Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

scrape hyperlinks from HTML file 1

Status
Not open for further replies.

Alt255

Programmer
May 14, 1999
1,846
US
I remember doing this many years ago but it's stumping me now. How can I extract the hyperlinks (as text) from a local HTML file?


Add water (makes its own sauce).
 
Never mind. I'll just parse any line that contains "HREF". That's probably what I did in the past. I had just hoped for something a little more eloquent.


Add water (makes its own sauce).
 
Code:
Option Explicit

Public Sub doit()
    Dim Links As Collection
    Set Links = GetLinksFromHTML("file://D:\downloads\example.html") 
    [COLOR=green]' Links contains collection of links from a file on local system[/color]
    Set Links = GetLinksFromHTML("[URL unfurl="true"]https://www.google.com")[/URL]
    [COLOR=green]' Links contains collection of links from a file from a web server[/color]
End Sub

[COLOR=green]' XMLHTTP Open doesn't figure out the protocol required, so you need to pass that as part of the URL[/color]
Public Function GetLinksFromHTMLFile(strURL As String) As Collection
    Dim objHttp As Object
    Dim myHTMLDoc As New HTMLDocument
    Dim linkcollection As New Collection
    Dim item As Variant
    
    Set objHttp = CreateObject("MSXML2.XMLHTTP") ' avoid using the old IE stack
    objHttp.Open "GET", strURL, False '
    objHttp.send

    myHTMLDoc.body.innerHTML = objHttp.responseText

    For Each item In myHTMLDoc.getElementsByTagName("A")
        linkcollection.Add item.href
    Next
    
    Set GetLinksFromHTML = linkcollection

End Function
 
Grr - copied and pasted from work in progress version. The below is better

Code:
[COLOR=blue]Option Explicit

Public Sub doit()
    Dim Links As Collection
    Set Links = GetLinksFromHTMLFile("file://D:\downloads\yourhtmlfile.html")
    [COLOR=green]' Links contains collection of link URLs from an HTML file on local system[/color]
    Set Links = GetLinksFromHTMLFile("[URL unfurl="true"]https://www.google.com")[/URL]
    [COLOR=green]' Links contains collection of link URLs from an HTML file from a web server[/color]
    Stop
End Sub

[COLOR=green]' XMLHTTP Open doesn't figure out the protocol required, so uou need to pass that as part of the URL
' Requires reference to Microsoft HTML Library[/color]
Public Function GetLinksFromHTMLFile(strURL As String) As Collection
    Dim objHttp As Object
    Dim myHTMLDoc As New HTMLDocument
    Dim linkcollection As New Collection
    Dim item As Variant
    
    
    Set objHttp = CreateObject("MSXML2.XMLHTTP") [COLOR=green]' avoid using the old IE stack[/color]
    objHttp.Open "GET", strURL, False '
    objHttp.send

    myHTMLDoc.body.innerHTML = objHttp.responseText

    For Each item In myHTMLDoc.getElementsByTagName("A")
        linkcollection.Add item.href
    Next
    
    Set GetLinksFromHTMLFile = linkcollection

End Function[/color]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top