Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract text from Word document 2

Status
Not open for further replies.

sglab

Technical User
Jul 29, 2003
104
US
Hello everyone,

Does anyone know how to extract text from Word document?
I want to be able to export text to text file while retaining page structure. Saving document as text file creates a file without page breaks. I need to retrieve text from each page and insert it into text file with page break character - chr(12) - between them. I don't know in advance if document will contain tables or any non-text elements. So there has to be some generic approach.
I don't seem to find appropriate object or properties to work with in order to do this. Of course it's mostly because of lack of experience working with Word.
Any help or tips on that would be greatly appreciated.

Thank you.
 
Duncan,

Yes indeed, this post was more informative.
I'll try, probably last time, to explain why TIFF images along with OCR text is a standard for many of our clients.
I don't know if you'd want to listen, but anyway.
Documents for processing come to us on different media: paper that we scan; removable media: CDs, DVDs, HDDs; entire CPUs. The ones, that in electronic format vary from Office documents to AutoCAD drawings to say, Crystal Reports to Outlook or Lotus Notes mailboxes. I think it's quite fathomable that it's impossible for any law firm to have all imaginable applications installed on their desktops for their lawyers to be able to look at the documents . That's why there are companies like mine that convert all of these files to "common denominator" - TIFF images, that an IT support person in a law firm could load them into the system so that a lawyer or paralegal could open a legal support application and work with them using applicaction's image viewer. Besides that we do coding of the documents or extract metadata. Moreover, they (lawyers and paralegals) want to be able to do searches on image collection based on some criteria. That's where OCR and meta- or coded data come in. OCR text along with other data are stored in database on the back-end and database's full-text search capability allows for fast searches across thousands of documents, that used to be Office documents, AutoCAD drawings etc. That's basically it and no matter what we or anybody else think of it, imaging and OCR are just that - parts of the standard process.
As far as your Perl suggestion. First of all I have to say that I already extracted text using my VB application, but it never hurts to take a look at something different. So if you want to share some of your ideas I'd appreciate it.
Say "Thank you" for me to your friend for the link to MSDN Library topic.

Kind regards.
Sergey
 
Right you are Gerry.
I want to thank you one more time for all your help.

Have a great day.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top