Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract all hyperlinks from a Word 2010 Document 1

Status
Not open for further replies.

pattyjean

Technical User
Mar 3, 2006
53
US
I would like to Extract all hyperlinks from a word Document to list them all in one document.



 
That's nice.

However, unless you can tell us what issues you're having doing this, how the document is formatted, etc, it's hard to be sure if the following will work:
• Use Ctrl-A, then mark all text as hidden. If it disappears click on the ¶ symbol on the toolbar/ribbon to make it all visible again.
• Using Find/Repace, do a Find for all text in the Hyperlink Style, setting the Replace parameter to 'Not Hidden'
• Using Find/Repace, do a Find for all hidden text, setting the Replace parameter to ^p
• Using a wildcard Find/Repace, delete the 'hidden text' setting and do a Find for [^13]{1,}, setting the Replace parameter to ^p
What you should end up with is a list of all hyperlinks in the document. All of the above assumes your hyperlinks are formatted as such, with the Hyperlink Style.

Cheers
Paul Edstein
[MS MVP - Word]
 
Thank you Macropod.
I guess I don't work in word enough to understand what you are asking me for. What do you mean formatted? It's just a typical word doc with hyperlink attached to text.
I am also not sure how to set the replace to Not Hidden or hidden text.

In the find box how to I do find for all text in the Hyperlink Style? Is there a special code? Thank you in advance as we have over 2000 hyperlinks that we need to index at the end.

Using Find/Repace, do a Find for all text in the Hyperlink Style, setting the Replace parameter to 'Not Hidden'
• Using Find/Repace, do a Find for all hidden text, setting the Replace parameter to ^p
• Using a wildcard Find/Repace, delete the 'hidden text' setting and do a Find for [^13]{1,}, setting the Replace parameter to ^p
 
Formatted: Do you hyperlinks look like & function as hyperlinks?

The rest is simply a matter of learning to use the options available to you on the Find/Replace dialogue. You may need to click on the 'More' button to access them, especially the 'Format' options you'll need to use.

Cheers
Paul Edstein
[MS MVP - Word]
 
the hyperlinks have nothing in front of them but are blue and underlined. there is no a format
 
If they're blue & underlined, and act as hyperlinks when you click on them, then they are formatted as hyperlinks; if they don't act as hyperlinks, then they're not formatted as hyperlinks - they're simply blue underlined text formatted to look like hyperlinks. Another way to test is to press Alt-F9. Do the 'hyperlinks' change their appearance?

Cheers
Paul Edstein
[MS MVP - Word]
 
Got it! Thank you so much. One question - do I need to do all your steps mentioned in order?

Once we completed all the steps we saved the word document as a XML document and was able to open it with excel, so we have a list of the targets (hyperlink as true value, pdfs)

Hope I can make a macro to do all the steps.
thank you again.
 
Hi pattyjean,

Here's a macro to do the job:
Code:
Sub ExtractHyperlinks()
With ActiveDocument.Range
  .Font.Hidden = True
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Wrap = wdFindContinue
    .Format = True
    .MatchCase = False
    .MatchWholeWord = False
    .MatchWildcards = False
    .MatchSoundsLike = False
    .MatchAllWordForms = False
    .Style = "Hyperlink"
    .Text = ""
    .Replacement.Text = ""
    .Replacement.Font.Hidden = False
    .Execute Replace:=wdReplaceAll
    .ClearFormatting
    .Font.Hidden = True
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    .ClearFormatting
    .Text = "[^13]{1,}"
    .Execute Replace:=wdReplaceAll
  End With
End With
End Sub

Cheers
Paul Edstein
[MS MVP - Word]
 
Hi,
I'm very far from a Word VBA Guru, but would this macro not be a bit simpler? you get out a clean word doc with all the hyperlinks listed in paragraphs.

Code:
Function doHL()
    Dim nd As Document
    Dim a As Document
    Dim h As Hyperlink
    Dim r As Range
    
    Application.ScreenUpdating = False
    
    Set a = ActiveDocument
    Set nd = Documents.Add
    
    For Each h In a.Hyperlinks
        Set r = nd.Range
        r.Collapse
        r.InsertParagraph
        r.InsertAfter (h.Address)
    Next

    nd.Activate
    Application.ScreenUpdating = True
    Application.ScreenRefresh
End Function
 
Hi jpadie,

Your code might be 'simpler', but it's far less efficient once you get beyond a few hyperlinks. FWIW, for all its extra lines, my code does all the extraction, even in a document with 100,000 hyperlinks, in four simple steps. Your's would probably still be running hours after mine has finished.

Cheers
Paul Edstein
[MS MVP - Word]
 
On the other hand it has the advantage that, with a very minor change, it can show the real target, which pattyjean seems to have suggested is the goal in their post of 3 Dec 12 13:40, for example:

Code:
[blue]'Private Declare Function GetTickCount Lib "kernel32" () As Long

Public Sub GetHyperlinks()
    Dim myDoc As Document
    Dim wombat As Hyperlink
'    Dim starttime As Long
    Dim CurrentDoc As Document
    
    Application.ScreenUpdating = False
    Set CurrentDoc = ActiveDocument
    Set myDoc = Application.Documents.Add()

'    starttime = GetTickCount
    For Each wombat In CurrentDoc.Hyperlinks
        myDoc.Range.InsertAfter wombat.TextToDisplay & vbTab & wombat.Address & vbCrLf
    Next
'    Debug.Print GetTickCount - starttime

    Application.ScreenUpdating = True
    myDoc.Range.ParagraphFormat.TabStops.Add CentimetersToPoints(7.5), wdAlignTabLeft, wdTabLeaderSpaces 'basic formatting
End Sub
[/blue]

Furthermore, an actual test of your assertion on performance (against a 234 page document with over 8000 hyperlinks) indicates that the contrary is true - performance of jpadie's solution (or at least my variant above) starts to convincingly outstrip the find/replace solution as the number of hyperlinks goes up.
 
Setting 'Application.ScreenUpdating = False' makes a fairly fundamental difference. If you're going to use that for a timing comparison, you should use it in both implementations.

Cheers
Paul Edstein
[MS MVP - Word]
 
i suspect that paging between the documents would slow down the script, even with screen updating off.

I tried to experiment with storing the targets in a string and then finally inserting into a new document. I tested on a file with 280000 hyperlinks across 8000 pages and got bored after ten minutes (so force quit the app). in the meantime I wrote a php app to open the raw xml and retrieve the hyperlinks. that op takes milliseconds...

i know that VBA is not a real language but i'm still really surprised by how badly optimised it is. Luckily I never have to use it for anything other than the most trivial things.
 
>If you're going to use that for a timing comparison, you should use it in both implementations.

I did
 
And this is the test version of your code that I used against the same document as my code:

Code:
[blue]Sub ExtractHyperlinks()
Dim starttime As Long

Application.ScreenUpdating = False
starttime = GetTickCount
With ActiveDocument.Range
  .Font.Hidden = True
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Wrap = wdFindContinue
    .Format = True
    .MatchCase = False
    .MatchWholeWord = False
    .MatchWildcards = False
    .MatchSoundsLike = False
    .MatchAllWordForms = False
    .Style = "Hyperlink"
    .Text = ""
    .Replacement.Text = ""
    .Replacement.Font.Hidden = False
    .Execute Replace:=wdReplaceAll
    .ClearFormatting
    .Font.Hidden = True
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    .ClearFormatting
    .Text = "[^13]{1,}"
    .Execute Replace:=wdReplaceAll
  End With
End With
Debug.Print GetTickCount - starttime
Application.ScreenUpdating = False
End Sub[/blue]
 
]>i'm still really surprised by how badly optimised it is

it isn't really VBA itself that is the culprit with your code, it is the fact that you are using relatively expensive (slow) Word operations: Collapse and InsertParagraph.
 
I live and learn!

I wrote an alternative that just stored the addresses in a string and didn't write it anywhere (so no 'expensive' calls). I quit the app again after 25 minutes running on the same document (8k pages 200k+ hyperlinks).

Ho hum ...
 
Well, my admittedly paltry 8000 hyperlinks only took about 6 or 7 seconds on a somewhat ageing 3Ghz Pentium 4
 
Now tried it against 48384 links in a 98000 word document. Took about 136 second. The find/replace solution is taking somewhat longer. Currently 8 mins and counting. ISuspect that memory will be a factor here. Will have to test on my monster at home tonight[link ][COLOR=]Link[/color][/url]
 
curious. i'm using a 2.53 Ghz core 2 duo with 4GB RAM. but am using MacWord which may well not have an optimised memory handler or VBA compiler.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top