Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Word VBA Question

Status
Not open for further replies.

FredAt

Programmer
Feb 9, 2007
8
LU
Hello All,
Sorry if this looks like a cross post - it looks like I originally posted in the wrong forum

I am a very experienced programmer but haven't used VB in many years now and never really done much Word automation at all. For a current project I need to extract detailed information from Word documents - page format, headers, footers, text content, tables, images... - and store it in a custom file for onward processing. I CAN do this by exporting the Word to RTF and parsing the RTF. However, I suspect there is an easier route via Word and VBA.

I would be most grateful if someone here could confirm this and point me to some useful code samples.

Thanks in advance.
 



Hi,

Check out the MS Word Object Model. The Collections may have many of the properties that you need.

Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
To give specific suggestions we need specific questions. Your question is far too general.

I do not know what you mean by "onward processing". What do you mean by "custom file"? A Word file? A plain-text file?

Skip is quite correct. The objects and collections within Word will cover everything you need...whatever that is.

Most likely, yes, whatever it is you wish to do can be done with VBA. But again, we need specifics.

Gerry
 
Thank you. I have been studying the Word VBA help file. It all seems to be pretty easy most of the way. There is however, one thing I haven't yet been able to resolve.

To put things into context first - when I say custom file it could in fact either be my own binary format or I could simply settle for creating a text file in Word itself.

In the custom file I need to store detailed information that will allow me to study the structure of the Word document later in another application. The things I need to study

a. The header & footer - layout, images, fields...
b. Document layout - page format, page orientation, columns
b. All textual content - paragraphs, justification...
c. Character styles (i.e. bold etc) in paragraph text
d. Tables - table layout, placement and cell content
e. Image - Image placement and retrieval of the original image

I can see that the various objects in the Word com interface - Documents, Paragraphs, Words, Characters, Sentences, Tables etc will allow me to do a lot of this. What remains unclear

a. I can get tables and/or pictures. However, it is not clear to me that I can find the placement of those objects relative to the text content.
b. Likewise it is not immediately obvious that I can grab the original images in a picture and store it, say as a hexadecimal string as is the case when the document is exported to RTF.

I hope that makes my question clearer. Like I said - I can do all of this via an RTF export from Word. I am just wondering if the COM interfaces don't make for an easier way.
 
Whoa! That is asking quite a lot, and I am seriously wondering what/how you are going to do that. For example:

"In the custom file I need to store detailed information that will allow me to study the structure of the Word document later in another application."

"or I could simply settle for creating a text file in Word itself."

OK, so let's suppose a text file. WHAT would be the text structure of the information about - say - "The header & footer - layout, images, fields... "?

First of all, in EVERY Section of a Word document there are 6 header/footer objects. So, assumably you would have to have six...ummmm...chunks of text (in your text file about the document) for each header/footer.

Structured how? This could be a huge amount of information, and I am trying to picture how it would look. Nevertheless, it could (I suppose) be done, but I am wondering...WHY??????
a. The header & footer - layout, images, fields...
b. Document layout - page format, page orientation, columns
b. All textual content - paragraphs, justification...
c. Character styles (i.e. bold etc) in paragraph text
d. Tables - table layout, placement and cell content
e. Image - Image placement and retrieval of the original image
a, b, the second b, c, and d are fairly do-able (although definitely non-trivial and huge), but e will be a tough one.
Likewise it is not immediately obvious that I can grab the original images in a picture and store it,
No, depending on what the image is (linked or embedded) you will not be able to do this.

Gerry
 
Just to reiterate the shear volume of information...
Code:
   With Selection.ParagraphFormat
      .LeftIndent = InchesToPoints(0.2)
      .RightIndent = InchesToPoints(0)
      .SpaceBefore = 0
      .SpaceBeforeAuto = False
      .SpaceAfter = 0
      .SpaceAfterAuto = False
      .LineSpacingRule = wdLineSpaceSingle
      .Alignment = wdAlignParagraphLeft
      .WidowControl = True
      .KeepWithNext = True
      .KeepTogether = False
      .PageBreakBefore = False
      .NoLineNumber = False
      .Hyphenation = True
      .FirstLineIndent = InchesToPoints(0)
      .OutlineLevel = wdOutlineLevelBodyText
      .CharacterUnitLeftIndent = 0
      .CharacterUnitRightIndent = 0
      .CharacterUnitFirstLineIndent = 0
      .LineUnitBefore = 0
      .LineUnitAfter = 0
      With .TabStops
         .ClearAll
         .Add Position:=InchesToPoints(3), _
            Alignment:=wdAlignTabLeft, _
            Leader:=wdTabLeaderSpaces
      End With
      With .Borders(wdBorderLeft)
         .LineStyle = wdLineStyleSingle
         .LineWidth = wdLineWidth050pt
         .Color = wdColorAutomatic
      End With
      With .Borders(wdBorderRight)
         .LineStyle = wdLineStyleSingle
         .LineWidth = wdLineWidth050pt
         .Color = wdColorAutomatic
      End With
      With .Borders(wdBorderTop)
         .LineStyle = wdLineStyleSingle
         .LineWidth = wdLineWidth050pt
         .Color = wdColorAutomatic
      End With
      With .Borders(wdBorderBottom)
         .LineStyle = wdLineStyleSingle
         .LineWidth = wdLineWidth050pt
         .Color = wdColorAutomatic
      End With
      With .Borders
         .DistanceFromTop = 1
         .DistanceFromLeft = 4
         .DistanceFromBottom = 1
         .DistanceFromRight = 4
         .Shadow = False
      End With
   End With
That is most (but not all) of the properties of a single paragraph. Now if you are going to - say - write that to a text file in a INI file type format it could come out like:

[Paragraph_1]
' other stuff!!!!!
SpaceBefore = 0
SpaceBeforeAuto = False
SpaceAfter = 0
SpaceAfterAuto = False
LineSpacingRule = wdLineSpaceSingle
Alignment = wdAlignParagraphLeft
WidowControl = True

' LOTS of other stuff


[Paragraph_2]
' other stuff!!!!!
SpaceBefore = 0
SpaceBeforeAuto = False
SpaceAfter = 0
SpaceAfterAuto = False
LineSpacingRule = wdLineSpaceSingle
Alignment = wdAlignParagraphLeft
WidowControl = True

' LOTS of other stuff!!!!

etc. etc. for EACH paragraph (which could be in the thousands.

Further, if this is a text file used by another application, you would not be able to use Word constant value (I assume). Maybe, but possible not. If not, the that other app would hav eno idea what (for example) wdLineSpaceSingle means. In which case you would have to change:

LineSpacingRule = wdLineSpaceSingle

to

LineSpacingRule = 0

to use the actual numeric value. THAT would mean you would have to get those actual numeric values. That alone will not be trivial, as while I have a copy of all Word constants for version 2002 (it is two columns...30 pages long), I do not know if such a document exists for later versions. Microsoft got steadily more reticent about information. Yes, you CAN get the constant values using Object Browser, as the numeric value is displayed there. However, getting the actual numeric values via that route would drive anyone completely mad.

I don't know. This seems a gigantic project, and I really wonder WHY you need to do this.

Gerry
 
Thanks guys. I think you have convinced me that I am better of playing with the RTF as I am well used to doing. I am not at all persuaded that if I get started on the VBA approach I will get all the information I require.

As to HUGE - have you ever seen the RTF that Word puts out? It is positively GIGANTIC. The only reason I can handle it is that I have been there before, understand the format thoroughly (it is actually a very nice format) and have plenty of code for parsing it already in place.
 
have you ever seen the RTF that Word puts out?
Yepp! And I was utterly astounded in what possible and impossible positions Word feels like adding a Linebrake.
Linebreaks WITHIN field codes, WITHIN bookmark codes etc., plus absolutely nonsensical character formatting - again WITHIN field codes.

Word RTF is one of the ugliest pieces of [insert strong word] that I have seen...


[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell
 
Although not of great significance yet, you should be aware that RTF will not be enhanced to include new features starting from Word 2010.

Enjoy,
Tony

------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.

I'm working (slowly) on my own website
 
Having looked into this in more detail I found that I can actually get Word 2003+ to give me everything I want - just get it to save as a WordML file. Like RTF it has the benefit of having no hidden secrets but is at the same time nice and clean.

RTF in its original form was a nice idea. Then it got ugly as they made changes and every successive version of Word attempted to output RTF that would make sense to ever prior version.

RTF was my first thought since I have used it a lot before. However, it is pretty much dead now.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top