Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Reading MSWord .DOC as binary stream (offsets)

Status
Not open for further replies.

valtersp

Technical User
Aug 29, 2002
15
LV
Hi all!

I have had to develop the cross-platform app allowing to read ONLY text from MS-Word doc file into text box.

Inspecting the .DOC file I found text begins at offset 600h. So I put the pointer to this offset and read the file as binary stream until 3 or more 0-valued bytes encounter.

In the most cases this is OK. But there are problems with some MS-WORD docs: the text DOES NOT BEGINS at 600h but somewhere else.

Maybe someone has had met with such a problem and found these zealous offsets. I've heard, that .DOC file header (or even footer) contains such an information, but I did't find the right way in the garbage put there yet.

Thanks!
 
DOC files are MUCH more complex than that, I'm afraid. They are examples of Structured Storage files, effectively a file image of a (hierarchical) COM object(s).
 
I've already implemented another offset (A00h) for files created on PC's propably hurted by virus. The text into these .DOCs are not contigous, but divided into 2 parts with "space" length of 61Ah. I hope the exception list would not be too long :)

Currently I gave the app to my colleagues to test it and asked them to give feedback if something wrong.
 
The layout of Word97 (and loads of other file formats) can be found at:

As strongm says, it's more complex than it looks!
________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'People who live in windowed environments shouldn't cast pointers.'
 
Thanks, johnwm!

It wouldn't be piece of cake, I know. But anyway, I have to write the app otherwise my coleagues have to open .DOC file outside the word, manually locate the text, copy it and throw away all unicode garbage.

Unfortunately not all software support unicode chars :)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top