Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How can I parse Word Documents and convert to XML

Status
Not open for further replies.

WestSide2003

Technical User
Jan 22, 2003
42
0
0
US
Hi,

I wanted to parse MS Word Documents either (.doc, .rtf, etc)

And basically convert them to XML.

Is there a Word Parser or something that will extract information from a Word Doc??

My main goal is to convert a .doc file into xml format...

Can anyone help here?

Thanks....

 
Ok...

That will work for an individual but we are wanting to have a process on our website/server that converts files from .doc, .rtf, .html, etc to xml automatically.

I am more than sure in Office 2003, you can open a doc, and just hit convert, but we are looking more for a solution that runs on the server, when a user uploads a document in .doc format for example, it will be converted to a structured format (xml)automatically... we have thousands of documents..

Any thoughts?

Thanks..

-WestSide
 
Unfortunately the .doc format is a proprietary binary format, so it would take a fair amount of code to extract the structure of it into xml. You would first need to know how the file was structured (I'm sure M$ won't be putting that into the public domain anytime soon)

There are products on the market that already do this though:


Even a freebie:
 
>> we are wanting to have a process on our website/server

Kind of an important item to leave out of that first post and now we don't know what your server platform/environment is because you didn't tell us that. [sadeyes] If you want good answers you have to ask good questions.

I will tackle two environments
(A) Windows/IIS - With MS Word installed on the server you can use the COM interfaces to the MS Word Object Model to obtain information from documents.

(B) Windows or Unix/Java Web Server - Apache POI is native Java objects that can read/write MS Word documents and other MS Office files. I don't know which ones other than Excel.

In both cases i don't know if they will provide you with a solution to your complete requirements.

-pete
 
hi,

Based on styles, word documents(say .rtf or .doc) can be converted into XML files using VBA programming.
 
Hi,

Yes, we are using IIS, Windows 2000 Server edition

Basically we want to convert a word document to xml.

I have word installed on the server as of now.....

I am unclear on the path to take if I wanted to convert our word docs to xml format.

We have 1000 plus docs and are looking for a way to have it automated, not just for those 1000 docs, but for all future word documents uploaded to our server...

HTH

-WestSide
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top