Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Converting multiple HTML documents to PDF using Word 2

Status
Not open for further replies.

BlindPete

Programmer
Jul 5, 2000
711
US
Hello Tek-Tipers!

I have a long time client whose website is composed of articles, posted regularly. Sort of a daily eZine so to speak. They have asked me to determine a way that they could produce a quarterly PDF archive that would resembles a print magazine. Their CMS uses mysql and all the content in that database can be extracted as strict html markup.

The bit I am struggling with is the how best to do it. I am hoping someone here has done something similar and can give me a nudge. My simplistic prototype does the following:
#1 Extracts Articles from the database generating an HTML document for each one.
#2 Each HTML document is read, and translated to Word 2003 markups and inserted into a Word Document. Actually I only translate a few items at present: headings and images.
#3 Once complete using word automation a Table of Contents is generated/inserted. The document is converted to PDF and its done.

Where this methodology fails is mainly with hyperlinks: anchors in the same page and links to other related articles. This is a great value to the client that they wish to retain and I can not figure out how to do it. The link itself need not be preserved but a reference indicating the page number of the reference would be excellent.

I am not all that familiar with word automation or words abilities. Is there a way I can create a tag(s) in the document that will allow me to track/insert a page reference to that tag?

-Pete
Games the old fashion way with Dice, Paper and Pencils!
 
In this kb article you might record Method 1 with the macro recorder and see a little bit about whats going on inside word.


wjwjr

This old world keeps spinning round - It's a wonder tall trees ain't layin' down
 
Unless you are using a full version of Adobe, you will not be able to retain the hyperlinking ability within the created PDF. At least I do not think so.

You can "translate" the HTML hyperlinks, into Word hyperlinks - ie, links to other parts of the document. However, that is the Word document, not the PDF.

faq219-2884

Gerry
My paintings and sculpture
 
Thanks folks. Yes Gerry are correct. I have the full version of Adobe. You can instruct Adobe using distiller (but not PDF Writer) to retain the word links. PDF/Word cooperate nicely.

Gerry, Do you know the official name for Word's links to other parts of the word document. If I could just learn what to search for in Word's COM. I'm sure I can figure out a way to re-characterize the html hyper-links.

-Pete
Games the old fashion way with Dice, Paper and Pencils!
 
Do you know the official name for Word's links to other parts of the word document. "

Yes. They are called hyperlinks. I am not being facetious, that is the official name. Or rather, that is the object in the object model. Hyperlinks.

If you are going to do this, then you would have to spell out: " I'm sure I can figure out a way to re-characterize the html hyper-links." in more detail.

NOTE!!!!! Word uses bookmarks as targets for hyperlinks within a document. You can not just hyperlink to any old spot. You hyperlink to a specific spot, and that spot MUST be a bookmark. This makes sense. How else is Word to know where to jump to? It needs something. A URL? Sure thing. Another document with a valid path? Sure thing. A location within a document? Location within a Word document means a defined range. Which is precisely what bookmarks are - defined ranges.

In other words, yes, it is possible, but it may be more work/fussing than you wish.

faq219-2884

Gerry
My paintings and sculpture
 
Thank you White605: I use that trick all the time with Office Automation.

Thank you Gerry! That is a huge help. So what I have to do is:
- Convert HTML anchors to word Bookmarks (Defined Range).
- Create a bookmark for each imported HTML page.
- Convert HTML hyperlinks to word hyperlinks that refer to bookmarks instead of anchors and/or other html documents.

I was vague, mainly because I did not know enough to be specific. You cut the fog nicely. A star for you!

I still have a big hurdle, namely keeping track of all the bookmarks. Hopefully I can rely on the url structure as a naming convention and not have to keep a table of them somewhere while processing it all.

Thanks again!

-Pete
Games the old fashion way with Dice, Paper and Pencils!
 
Leon is now much bigger than that photo. He weighed about 15 lbs then, he is 21 now. It is painfully funny to watch him cram his bulk through the cat door. It takes him about 5 or 6 seconds of effort. The other cat just jumps through. I worry actually about Leon. During his squeeze he is very very vulnerable and we have coyotes here.

Let's cover your points:

- Convert HTML anchors to word Bookmarks (Defined Range).
- Create a bookmark for each imported HTML page.
- Convert HTML hyperlinks to word hyperlinks that refer to bookmarks instead of anchors and/or other html documents.

1. No, you want to convert the HTML targets to bookmarks. Oh...OK, never mind, i see what you mean. Yes.

2. "create a bookmark for each imported HTML page" - not following that. Why?

3. yes to the first part. Again, if the location is a location in a document (and that can be the current document, or some other, as yet unopened, document), then that information MUST be a bookmark. If the hyperlink is to either a web page, or just a document, then it would NOT be a bookmark.

Here are some example. Say a document with THREE Word hyperlinks in it:

Hyperlink # 1 - link to a bookmark (Name = "Here") in the same document.

Address = ""
SubAddress = "Here"

NOTE!!!! If you edit a hyperlink, in Word, and it links to a bookmark in the same document, you do NOT get the Address dialog. Address = "". "" means current document.

If you edit a hyperlink, in Word, and it links to a different document, a bookmark in a different document, or a web page, then you DO get an Address dialog.

Hyperlink # 2 - link to a different Word document ("c\test\1-2 pages.doc"). The .Follow method (fired by clicking the hyperlink) will open that document.

Address = "c\test\1-2 pages.doc"
SubAddress = ""

The content of the Address box (if you look in the Edit dialog) is: "c\test\1-2 pages.doc"

Hyperlink # 3 - link to a specific bookmark (named "There") - a specific location - in different Word document ("c\test\7-8 pages.doc"). The .Follow method (fired by clicking the hyperlink) will open that document, and move the selection (the cursor) to THAT location.

Address = "c\test\7-8 pages.doc"
SubAddress = "There"

The content of the Address box (if you look in the Edit dialog) is:
"c\test\7-8 pages.doc#There"

Hopefully, you can see by these that there are significant differences in the data held as properties in the hyperlink objects, and the way that data is written/displayed in the dialog.

Address is property (pointer really) to the container, the document.

SubAddress is the property containing a bookmark range values.

As properties, they are separate. As data in the dialog, they are combined.

VBA syntax requires:
Code:
ActiveDocument.Hyperlinks.Add _
      Anchor:=Selection.Range, _
      Address:= "C:\Test\7-8 pages.doc", _
      SubAddress:="There", _
      ScreenTip:="", _
      TextToDisplay:="Yadda yadda"
Address and SubAddress (the bookmark) must be separate.

In the Edit Hyperlink dialog however, Address is written:

"c\test\7-8 pages.doc#There"

You can edit directly in that dialog. Say you know the document 7-8 pages.doc has another bookmark - "AnotherThere". You can edit in the dialog to make it:

"c\test\7-8 pages.doc#AnotherThere"

Pressing OK will write the separate properties to the hyperlink object.

faq219-2884

Gerry
My paintings and sculpture
 
Thats going to save me a ton of time. Thank you so much.

I am combining a hundred or so web pages into a single PDF document. The pages mainly refer to anchors w/i the same page, some refer to other web pages. I know which URLs are being incorporated into the document. Deciding which to leave as URL hyper-links and which to convert to WORD hyper-links is do-able.

For the word hyper-links. I'll have to be careful with the SubAddress value. I need each bookmark's sub-address to be unique with in the word document. I'll have to be clever to make that work out but I believe that is also do-able.

Another star for you!

OT/BTW
I have tuxedo with similar proportions to your Leon. He is an all indoors cat though. 12+ years old now and not to agile anymore. We have the occasionally coyote strike, most neighborhood cats get into trouble with foxes here.

-Pete
Games the old fashion way with Dice, Paper and Pencils!
 
Hello Tek-Tipers!

Sort of closing out the loop here. I changed my strategy somewhat. I realized that word does a superior job of converting html into word format. Furthermore I substituted a different CSS file, just for the import of the HTML. In this way I handle 100% of the format related issues and all the image embeds.
oRange.InsertFile App.Path + "\qwerty.html", "", False, False, False

Furthermore rather then use bookmarks I am relying on Word's style's to structure the document for me. It is working exceedingly well as Adobe Distiller changes all of those to bookmarks with hierarchy automatically.

I still have to work out the hyperlinks/anchors to in document links. I am saving that for last and I may drop them, or leave them as web links as I am close to using up the client's budget for this task.

Thanks again fumei aka Gerry.

-Pete
Games the old fashion way with Dice, Paper and Pencils!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top