Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help trying to parse msword document from text format...

Status
Not open for further replies.

bppj

Programmer
Jun 6, 2002
24
0
0
US
I have a situation where I'm trying to turn word documents and forms into XML. I would like to be able to use the text output of a word document with regular expressions in order to determine the format of my XML document.

I've done a reasonable enough job with the paragraphs and headers, etc, but I get lost in any embedded tables in the document. Which leads to my question:

Does anyone know of a good resource for what ANSI or UNICODE values MSword uses in various scenarios (i.e. which values precede an embedded table column?) for formatting?

Any other input on how to do this would be great!

Thanks in advance!

B.J.
 
If you are trying to get pure data from the table cut and paste it to an Excel spread sheet and save to a csv file.

The Data|Text to Columns... dialog is very useful in slicing and dicing text and numbers. A couple of suggestions before you start. Insert an index column in the original data so you can sort the rows back to the original order. And, if you are successful in a change, SAVE it quick befe you try something else.

Another command in excel that may be useful is the seach and replace for inserting XML code or changing multiple instances of a string.

Good luck,

David
 
David,

Thanks for the input. I appreciate it!

I was hoping to avoid the cutting/pasting thing because I am trying to build a tool that will be able to convert the 100 or so Word docs I have been given in a batch. The tool will also be used for future one-time conversions. I am also trying to avoid using very much of the office API commands, as part of the development goals is to not have to have the Word api available for the conversion (not likely, I am finding). I am having quite a bit of success with everything but the table looking for patterns in formatting, i.e. converting:

Title: This document

Purpose: The purpose is something useful.

Into something like:

<Title>This document</Title>
<Purpose>The purpose is something useful.</Purpose>

This xml format can be used for various applications. This conversion looks for patterns in the text, i.e. look for a newline, followed by text, followed by the character &quot;:&quot;, use that as the element name. Then I take all the text to the next match in the pattern and use it for the &quot;inner text&quot; of the element.

If you understand regular expressions, the above explanation probably seems very basic, if not, it probably seems a little rough. Anyway, the point is, I am doing this by reading in the &quot;text&quot; of the document and looking for the patterns. However, when I get to the tables, I am getting lost because I don't know what &quot;invisible characters&quot; i.e. what unicode or ansi characters, word is using to separate columns and rows (probably newline) so I don't know what pattern to look for.

Hence, that is why I am looking for a resource that can tell me what MSWord uses to identify the different parts of a table, or any formatting, for that matter. These come out as the infamous little &quot;blocks&quot; when you look at a word document in a text editor before converting it.

If anyone can tell me of an easier way to do this that utilizes less of the office programs itself, I would really appreciate it! Right now, because I don't know the encoding pattern, I can see two choices:

1. Reverse engineer and &quot;figure it out&quot;
2. Use the &quot;table&quot; objects in the word API. (not desirable, since I want to avoid the need for having word installed on the machine running the conversion program.)

Anyway, I hope this verbose explaination of what I am trying to do inspires somone! Thank you so very much in advance!

B.J.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top