Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Webscrape vfp9 3

Status
Not open for further replies.

Luikj

Technical User
Jul 23, 2019
5
0
0
US
* Want to capture the data using this so I can do some "Machine Learning" using R.
* would like to get guidance if I can, Lui
DECLARE INTEGER ShellExecute IN shell32.dll ;
INTEGER hndWin, ;
STRING cAction, ;
STRING cFileName, ;
STRING cParams, ;
STRING cDir, ;
INTEGER nShowWin
cFileName = "cAction = "OPEN"
ShellExecute(0,cAction,cFileName,"","",1)
*Can view the web page view with the above and
*Want to capture that table and scrape it to
*my database table called "investorprices",*FIELDS: id_pk, commodityname,commoditymonth,last,high,low,
*change,changepercent,time,RecordTs

* On the view page source are lines 1505 to 2018
*<tbody>
* <tr>
* <td class="flag"><span title="" class="ceFlags gold">&nbsp;</span></td>
* <td class="bold left plusIconTd noWrap elp"><a title="Gold Futures" href="/commodities/gold">Gold
* </a><span class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-tooltip="Create Alert"
* data-name="Gold Futures" data-id="8830"></span></td>
* <td class="left noWrap">Aug 19 </td>
* <td class="pid-8830-last">1,422.85</td>
* <td class="pid-8830-high" >1,430.45</td>
* <td class="pid-8830-low" >1,414.70</td>
* <td class="bold redFont pid-8830-pc" >-4.05</td>
* <td class="bold redFont pid-8830-pcp" >-0.28%</td>
* <td class="pid-8830-time" data-value="1563902321" >13:18:41</td>
* <td class="icon"><span class="greenClockIcon">&nbsp;</span></td>
* </tr>
* .
* . 35 different items with this type of data each separated by <tr>...<tr>
* .
*<tbody>
 
How far did you get with it?



If you want to get the best response to a question, please check out FAQ184-2483 first.
 
Opening the URL in ShellExecute() will merely display the page in your default web browser. It doesn't give you programmatic access to the underlying code. To do that, you need to actually download the page into a variable.

There are several ways of doing that. One is to use the Internet Transfer control. Somehting like this:

Code:
oNet =  CREATEOBJECT("InetCtls.Inet.1")
lcPage = oNet.OpenURL("[URL unfurl="true"]https://www.investing.com/commodities/real-time-futures")[/URL]

lcPage will now contain the page's underlying HTML. You can now use VFP's full reportoire of string-handling functions to extract the actual content that you want. For example to extract everything between two given tags, you might use STREXTRACT().

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Luikj,

If the HTML page is prepared and transformed into an XML document, VFP's XMLTOCURSOR() function can be used to store the downloaded data and then processed as you want.

The all process can be done with a) some simple VFP functions to clean up the HTML and turn it into valid XML, and b) an XSL transformation that will rearrange the data so that XMLTOCURSOR() may be able to consume it.

Code:
LOCAL HTTP AS WinHttp.WinHttpRequest

* read the page from the website
m.HTTP = CREATEOBJECT("WinHttp.WinHttpRequest.5.1")
m.HTTP.Open("Get", "[URL unfurl="true"]https://www.investing.com/commodities/real-time-futures",[/URL] .F.)
m.HTTP.Setrequestheader("Content-type", "text/html")
* signal "I'm a browser", otherwise the server will not allow
m.HTTP.Setrequestheader("User-Agent", "Mozilla/VFP")
m.HTTP.Send()

LOCAL HtmlPage AS String
LOCAL CrossRateTable AS String

* get the page in HTML
m.HtmlPage = m.HTTP.Responsetext
* isolate the body of the cross rate table
m.CrossRateTable = STREXTRACT(STREXTRACT(m.HtmlPage, ' id="cross_rate_1"', ""), "<tbody>", "</tbody>", 1, 4)

* remove non-break-spaces (XML does not know what this is)
m.CrossRateTable = STRTRAN(m.CrossRateTable, "&" + "nbsp;", " ")
* now we have an XML clean document

LOCAL XML AS MSXML2.DOMDocument60
LOCAL XSL AS MSXML2.DOMDocument60

LOCAL XSLCode AS String

* prepare the XML document processor and its transformer
m.XML = CREATEOBJECT("MSXML2.DOMDocument.6.0")
m.XSL = CREATEOBJECT("MSXML2.DOMDocument.6.0")

* the transsformation code
TEXT TO m.XSLCode NOSHOW
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="[URL unfurl="true"]http://www.w3.org/1999/XSL/Transform"[/URL]
  version="1.0">

  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:element name="VFPData">
      <xsl:apply-templates select="tbody/tr"/>
    </xsl:element>    
  </xsl:template>

  <xsl:template match="tr">
    <xsl:element name="crossrate">
      <xsl:element name="commodity">
        <xsl:value-of select="td[2]/a"/>
      </xsl:element>
      <xsl:element name="month">
        <xsl:value-of select="td[3]"/>
      </xsl:element>
      <xsl:element name="last">
        <xsl:value-of select="translate(td[4], ',', '')"/>
      </xsl:element>
      <xsl:element name="high">
        <xsl:value-of select="translate(td[5], ',', '')"/>
      </xsl:element>
      <xsl:element name="low">
        <xsl:value-of select="translate(td[6], ',', '')"/>
      </xsl:element>
      <xsl:element name="change">
        <xsl:value-of select="translate(td[7], ',', '')"/>
      </xsl:element>
      <xsl:element name="change_percent">
        <xsl:value-of select="translate(td[8], ',%', '')"/>
      </xsl:element>
      <xsl:element name="time">
        <xsl:value-of select="td[9]"/>
      </xsl:element>
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>
ENDTEXT

* load the source
m.XML.LoadXML(m.CrossRateTable)
* and the transformer
m.XSL.LoadXML(m.XSLCode)

* transform and put the result inside a cursor
XMLTOCURSOR(m.XML.TransformNode(m.XSL), "CrossRateTable")

BROWSE

The result (as of today):
Clipboard01_nrrccm.png
 
António,

This example is worth more than a normal Star. Very good. And thank you for sharing.

Koen
 
Another way, using webbrowser control ( this extracts all the tables ):


Code:
************************************************
* Extract tables from html
* Marco Plaza , 2019 ( [URL unfurl="true"]https://github/nfTools[/URL] )
* no error management check etc..
************************************************

curl =  "[URL unfurl="true"]https://www.investing.com/commodities/real-time-futures"[/URL]

owbf = Createobject("webbrowser")


With owbf.wbc

	.silent = .T.
	.Navigate(m.curl)

	Do While .readystate # 4 Or .busy
		Wait 'contacting server.. Esc to cancel' Window Nowait
		if inkey(0.2) = 27 and messagebox('cancel?',4) = 6
			cancel
		endif
	Enddo

	tc=.Document.getelementsbytagname('table')

Endwith

Messagebox('Found '+Transform(tc.Length)+' tables ',0)

ntable  = 0

For Each Table In tc

	Set Textmerge To Memvar Xmltable Noshow
	Set Textmerge Delimiters To "<%","%>"
	Set Textmerge On

\<table>

	crows=Table.getelementsbytagname('tr')

	For Each tablerow In crows
	\<row>
		ccells=tablerow.getelementsbytagname('td')
		coln = 0
		For Each cell In ccells
			coln=coln+1
			ccn=Transform(coln,'@l 99')
			\<<%"col"+m.ccn%>><%cell.InnerText%></<%"col"+m.ccn%>>
		Endfor
	\</row>
	Endfor

\</table>

	Set Textmerge Off
	Set Textmerge To

	ntable = m.ntable+1

	Xmltable = Strtran( m.xmltable,'&','&amp;')

	tname = 'Table_'+Transform(m.ntable)
	Xmltocursor( m.xmltable,m.tname,4)

	Browse Normal Font 'consolas',14

Endfor


********************************************

Define Class webbrowser As Form
	Add Object wbc As se2
Enddefine

Define Class se2 As OleControl
	OleClass ='shell.explorer.2'
Enddefine


Marco Plaza
@nfoxProject
 
Marco, at that stage you could make use of Tamar Granor's advice on how HTML tables are easily transferred to Excel. I just don't find the thread we once had. Must have been years ago and I think it was about the fastest way to export to excel as XLSX was new and/or COPY TO TYPE XL5 didn't suffice.

faq184-3005 describes how to export DBF to HTML tables and then open them in Excel. You can skip the step to create the HTML table, as you come from there.

The step to get from Excel to DBF then remains, but if the data is just numeric, that should be easy.

So in summary: Once you are at HTML tag [ignore]<table>[/ignore] you can do

Code:
CD d:\temp && wherever you want to save the separated HTML tables

LOCAL lcHTMLTable, lcHTMLFile

ntable  = 0
For Each Table In tc
  lcHTMLTable = Table.OuterHTML

Text to lcHTMLFile textmerge noshow 
<HTML>)
    <HEAD>
    <META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=windows-1252'>
    <META NAME='Generator' CONTENT='VFP'>
    <TITLE>a</TITLE>
    </HEAD>
    <BODY><<lcHTMLTablel>></BODY>
</HTML>
EndText

ntable = ntable + 1 
StrToFile(lcHTMLFile,"table"+transform(ntable)+".html")
Endfor

Then open the HTML files with Excel and grab your data from there.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top