Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting HTML from browser - how? 2

Status
Not open for further replies.

TeddyBillewicz

Programmer
Jan 31, 2011
4
DE
In order to parse elements (for further processing in VFP) I need to extract the actual HTML-Code from a running browser.

Approach 1:

Using the basic technique from thread184-1207892 I built a form that uses the web browser ole control. In a commandbutton I'm using

Code:
        LOCAL loDoc

        loDoc = THISFORM.Olecontrol1.object.Document
        loDoc.focus() && document must have focus

to get a reference to the document but fail to find the pertinent properties (for example to access the HTML). Is there some documentation for the object model of this ole control or does anyone know how I could get a hold of the actual HTML the way it would be shown in a regular browser with right-click on the browser and "view source HTML code" (this may not be translated correctly, but is the function that shows the web site's HTML in IE, Chrome or Firefox)?

Approach 2:

Can I get access to an external browser (Chrome, Firefox, Internet Explorer) and then extract that same information simply while the user is online with his own browser?

Either approch could work for my application. I just don't know how to implement it under VFP 9.0 (SPK 2)

TIA for any help or suggestions

Teddy



 
Try this:

Code:
LOCAL loDoc

loDoc = THISFORM.Olecontrol1.object.Document
loDoc.focus() && document must have focus

lcText = loDoc.Body.InnerText

That will give you the text, without the HTML tags. Is that what you want?

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy
 
You can also do:

loDoc.Body.OuterText
loDoc.Body.OuterHTML
loDoc.Body.InnerHTML

I can't off-hand remember the difference between Inner and Outer, but you should be able to figure it out with a bit of experimenting.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy
 
Mike,

I am amazed! "I wish I had your brains" was my first reaction, but I decided that wasn't really accurate. I just wish I had what's in your brain in mine, too.

Actually your answer was exactly what I needed and it (.innerHTML as well as .outerHTMO) works well.

The differences between InnerHTML and outerHTML are minor and in portions of the code that I'm not using anyway.

I really appreciate your help!!!

Teddy
 
Actually the object model is the same as for the internet explorer, as the webbrowser control actually is the HTML rendering part of the IE (mainly, but not only shdocview.dll is involved).

The Docuemnt node (object) is the root of the often referred to DOM (document object model). Documentation exists in detail, eg in the msdn library (
The problem of course is the number of details. Finding something is often searching the needle in the haystack.

I find it useful to know whos been dealing with lotss of details about a topic. In this case, you must know or get to know Rick Strahl and West-Wind. Starting with this, perhaps:
Bye, Olaf.
 
Teddy,

I wouldn't wish my brains on anyone, but thanks for your kind words.

If you want to do some heavy-duty parsing on the contents of the page, you should do as Olaf suggests, and look into the Document Object Model (DOM). Basically, it represents the content as a tree, which you can parse in a variety of ways, for example, to iterate through all the images or all the H1 headings.

But for more basic work, you can get by with VFP's string-handling features. I often use functions like OCCURS() and STREXTRACT() to parse information out of HTML code. It works quite well.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy
 
@Olaf

It's not that I'm not getting to your answers (you and Mike), it's just that you gave me a lot of homework and I wanted to get it done before asking more questions that get solved by reading what you pointed me to.

Ricks paper (using the apishell) is such an eye-opener. It's a whole new world out there... (you see I hadn't done much in that area - VFP and WEB).

Anyway, I'm doing my homework now. And each bit of reading and testing gets me going on something else somewhere else.

I'll be back shortly...

Teddy
 
In the meantime I have done a lot of reading, testing, trying, proving what doesn't work and am real close to getting all the core issues solved that I have involved. In some cases the syntax required baffels me and the documentation _does_ resemble a haystack (big one :)).

I have also started experimenting with the object browser and MSHTML but seem to be on kindergarten-level in terms of finding the things I need to find...

Here is a current problem that I'm working on:

On the site I'm working with I get a lot of my information using InnerHTML from the current url. Under certain conditions when the user clicks "on a farmer" a billboar-offer pops up of which I want to capture the text.

I can do it with copy&paste (which is what I did in my previous versions and then feed it to VFP via _cliptext. (the user does a ctrl-a, ctrl-c in that case)

But this Information doesn't show up in InnerHTML, OuterHTML, InnerText or OuterText since its in an iFrame. Manually, though I can do a save-as after clicking on that billboard and get the full information I need by parsing that code (the inner or outer HTML of the iFrame only).

This happens to be a frame or iFrame that I can even identify in the url. When I activate any of the 1 through 13 billboards (offers from farmers to buy something with a shopping list and an amount they are willing to pay)


In each of these situations I can get at the details via right-click-SaveAs but programmatically I fail to figure out the approriate syntax for adressing the InnerHTML of that iFrame.

After getting a reference to my browser with:

DO Form wi_webbrowser NAME oForm linked
o = oform.oBrowser

I think I should somehow be able to get my hands on the iFrame involved or at least to go through the iframe collection and get loFrames.InnerHTHML

For testing I tried something along the lines of

Code:
		loFrames = o.DOCUMENT.ALL
		FOR EACH loFrames IN loFrames
			lcHTML = loFrames.InnerHTML
		ENDFOR

but don't have any luck.

I think using the above address (which I can get from the url) I should somehow get the corresponding InnherHTML. But how to I request that?

The third frame (=billboard) has


and I would like to address this like:

Code:
		* locate all iFrames -  using TagName
		FOR EACH loTag IN o.DOCUMENT.ALL
			IF UPPER(loTag.Tagname) == "IFRAME"

				* how do I address the InnerHTML of the iFrame itself?

				lcResult = loTag.Tagname  && here I get "IFRAME", the name
				
				
				*produce errors
				* lcResult = loTag.inhherhtml
				* lcResult = lotag.parent.innerhtml
				* lcResult = lotag.body.innerhtml     

				* lcResult = lotag.document.body.innerhtml
				* Heureka - this works... NOPE - its not the iFrame
			
				* what could be the appropriate syntax here to catch the iFrame?
				lcResult = lotag.document.body.innerhtml
				
			ENDIF
		ENDFOR


Or: when I have the iframe I could do something like...

url:
Code:
	lchtml= [URL unfurl="true"]http://s36.wurzelimperium.de/verkauf.php?kunde=19059542&kundemap=i2.innerhtml[/URL]
How would I adress the object in question?

Or:

Code:
o.document.frames(1).innerHTML && ?  - if  I know from the url that it's frame 2 (i1) that I'm dealing with?

Any feedback would be greatly appreciated. I don't mind all the learning I'm doing currently, in fact I'm kind of thrilled. But sometimes I wish I could just continue with the debugger from my VFP-Debugger into the browser-object and then see what's there and what it is called...

TIA, Teddy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top