Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Scrape data from web site using VBA

Status
Not open for further replies.

FractalWalk

Technical User
Nov 18, 2002
141
0
0
US
I am trying to loop through a list of NFL players and extract stats from a website. I originally was using the InternetExplorer.application object but it was waaaay too slow and often run into problems with the page loads. After searching online I think using MSXML2.XMLHTTP60 is a quicker, cleaner way to get the data but I am having trouble parsing the html code to find what I want.

I have verified that the html is being returned correctly (I printed it), but when I try to use getElementsByTagName it doesn't find the tags. I'm new to XMLHTTP60, but I followed the steps I found on multiple sites of assigning the .repsonseText to an HTMLBody object. I would assume that would allow me to use .getElementsByClassName and .getElementsByTagName like I would with InternetExplorer.Document. But even though I am using the same code from these working examples, I can't get the code to see the tags.

Code:
Public Sub test()
    
    Dim IE As MSXML2.XMLHTTP60
    Dim HTMLDoc As MSHTML.HTMLDocument
    Dim HTMLBody As MSHTML.HTMLBody
    Dim [URL unfurl="true"]www As[/URL] String, txt As String
    Dim nme
    Dim objCollection As Object
    Dim n As Integer
   
    [URL unfurl="true"]www =[/URL] "[URL unfurl="true"]http://www.pro-football-reference.com/search/search.fcgi?search="[/URL]
    'this is an array of names read from Ecxcel
    nme = Cells(1, 1).CurrentRegion
    
    Set IE = New MSXML2.XMLHTTP60
    
    'Loop through all the names
    For x = 1 To UBound(nme)
    
        'search the website by name
        IE.Open "GET", [URL unfurl="true"]www &[/URL] nme(x, 1), False
        IE.send
        While IE.ReadyState <> 4
            DoEvents
        Wend

        Set HTMLDoc = New MSHTML.HTMLDocument
        Set HTMLBody = HTMLDoc.body
        HTMLBody.innerHTML = IE.responseText
        
        'The error occurs with the following line. No title tags are found (but there really is one) 
        'and so the obCollection is not set.  I have verified that the title tag exists as I printed it 
        'and saw it and using Split(IE.responseText,"<title>") correctly gives an array of two elements (i.e. one title tag).

        'extract the title tag to verify that the player page was returned
        Set objCollection = HTMLBody.getElementsByTagName("title")
        
        'NOTHING IS RETURNED FORM THE ABOVE LINE AND SO THE NEXT LINE ERRORS "Object variable not set"
        txt = objCollection(0).innerText        
        If txt Like "*" & nme(x, 1) & "*" Then
            'code for player found
        Else
            'code for player not found
        End If
        
    Next x
End Sub
 
HI,

In the Excel Ribbon, Data > Get external data > From Web

Paste the URL into the Address box. The site will be accessed. Scroll around to find the area of data that interests you. Notice the little YELLOW boxes, and the associated instruction to "Click [box] next to the tables you want to select, then click import."

This will put the data referenced on your sheet, where you can search for specific data.

Use your macro recorder to generate code that can then be modified as necessary.

Skip,
[sub]
[glasses]Just traded in my OLD subtlety...
for a NUance![tongue][/sub]
 

I appreciate the reply, unfortunately that doesn't help me.

When I do that there are no little yellow boxes to click. Secondly, that seems like it is opening up and loading the web page in a browser which is what I am trying to avoid due to my issues with page loading time errors and just plain taking too long. Lastly the table that I need to download alters by player and I won't know what it is called until the html is shown, so I need a dynamic way to search the page to find the specific table for each player.

I already have a way to extract the html quickly without loading the page to a browser. ANd it includes all of the data I need. I can even parse it using the Split function but that is messy and ineffecient. All I need is a way to parse the data using getElementsByTagName and getElementsByClass, which from everything I read should be easy, but I can't get it to work.

So the question I need answered is once I have retrieved the html code using IE As MSXML2.XMLHTTP60 object, what do I with that object be able to use getElementsByTagName and getElementsByClass? I thought is was this.

Code:
Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body
HTMLBody.innerHTML = IE.responseText


 
I've had this problem as well...

The thing is, the Title tag is in the Header of the document, and when you put the website in the HTMLBody element, you only get the BODY of the page....

I don't remember off hand how I handled that...

One way that might work, is to use the .createDocumentFromUrl method from MSHTML... that will put the entire webpage in the HTMLDoc element... then you can access the HTMLDoc.head.

Gcomyn
 
> opening up and loading the web page in a browser which is what I am trying to avoid

Except that's basically what your code does, since the XMLHTTP Object uses WinInet (i.e. Internet Explorer) for it's functionality (it just doesn't render it). SO maybe you have been suffering rendering [problems, as any page download problems should also affect the XMLHTTP object.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top