Reading MSWord .DOC as binary stream (offsets)

valtersp · Feb 24, 2003

Hi all!

I have had to develop the cross-platform app allowing to read ONLY text from MS-Word doc file into text box.

Inspecting the .DOC file I found text begins at offset 600h. So I put the pointer to this offset and read the file as binary stream until 3 or more 0-valued bytes encounter.

In the most cases this is OK. But there are problems with some MS-WORD docs: the text DOES NOT BEGINS at 600h but somewhere else.

Maybe someone has had met with such a problem and found these zealous offsets. I've heard, that .DOC file header (or even footer) contains such an information, but I did't find the right way in the garbage put there yet.

Thanks!

strongm · Feb 24, 2003

DOC files are MUCH more complex than that, I'm afraid. They are examples of Structured Storage files, effectively a file image of a (hierarchical) COM object(s).

valtersp · Feb 24, 2003

I've already implemented another offset (A00h) for files created on PC's propably hurted by virus. The text into these .DOCs are not contigous, but divided into 2 parts with "space" length of 61Ah. I hope the exception list would not be too long

Currently I gave the app to my colleagues to test it and asked them to give feedback if something wrong.

johnwm · Feb 24, 2003

The layout of Word97 (and loads of other file formats) can be found at:

http://www.wotsit.org/

As strongm says, it's more complex than it looks!
________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first

'People who live in windowed environments shouldn't cast pointers.'

valtersp · Feb 25, 2003

Thanks, johnwm!

It wouldn't be piece of cake, I know. But anyway, I have to write the app otherwise my coleagues have to open .DOC file outside the word, manually locate the text, copy it and throw away all unicode garbage.

Unfortunately not all software support unicode chars

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reading MSWord .DOC as binary stream (offsets)

valtersp

Technical User

strongm

MIS

valtersp

Technical User

johnwm

Programmer

valtersp

Technical User

Similar threads

Part and Inventory Search

Sponsor