Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract text from Word document 2

Status
Not open for further replies.

sglab

Technical User
Jul 29, 2003
104
US
Hello everyone,

Does anyone know how to extract text from Word document?
I want to be able to export text to text file while retaining page structure. Saving document as text file creates a file without page breaks. I need to retrieve text from each page and insert it into text file with page break character - chr(12) - between them. I don't know in advance if document will contain tables or any non-text elements. So there has to be some generic approach.
I don't seem to find appropriate object or properties to work with in order to do this. Of course it's mostly because of lack of experience working with Word.
Any help or tips on that would be greatly appreciated.

Thank you.
 
Hi sglab,

It would be possible to write some code to output any characters you want to a text file - and to work with whatever document structure you had but it could become quite comples. However there isn't really much point unless the end result is usable and text files don't work with formatting characters - they are just plain text. What exactly are you trying to achieve? If you want to be able to maintain the formatting but have a text file, take a look at using RTF files - or XML files, depending on your Word version.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
More details would help, but in principle this could be very easy.

Word has predefined bookmarks, one of which is a "\page" bookmark. You could loop through all pages in the Word document, extract them (copy), add a Chr(12), then paste into your text file. Something like:
Code:
Sub ExtractTextPages()
Dim SourceDoc As Document
Dim TextDoc As Document
Dim var

' make source doc an object
  Set SourceDoc = ActiveDocument
  
' make new document and make it an object
   Documents.Add
   Set TextDoc = ActiveDocument

' reactiveate source doc
   SourceDoc.Activate

' explicitly go to top of file
Selection.HomeKey unit:=wdStory
For var = 1 To _
   ActiveDocument.Range.Information(wdNumberOfPagesInDocument)
   ActiveDocument.Bookmarks("\page").Range.Select
   
   With Selection
      .Copy
   ' activate other doc and paste
      TextDoc.Activate
      .Paste
   ' go to end and add Chr(12)
      .EndKey unit:=wdStory
      .TypeText Text:=Chr(12)
   ' return to source doc, go to next page
      SourceDoc.Activate
      .GoTo what:=wdGoToPage, Count:=1
   End With
Next
' activate textdoc and save as text file
With TextDoc
   .Activate
   .SaveAs FileName:="whatever.txt", _
      FileFormat:=wdFormatText
End With
Set SourceDoc = Nothing
Set TextDoc = Nothing
End Sub

Gerry
See my Paintings and Sculpture
 
Hi Tony and Gerry,

Thanks for your replies.

1. To Tony - Well, this is what I'm trying to do here. I work for litigation support bureau and at this time we're converting native file provided by client - DOCs, XLSs, PPTs etc - to TIFF images. After that we need to OCR these TIFF images to provide searching capabilities for the client. Usually it's not a problem, but we have these few Word documents as big as 15550 pages each and OCR process would take ages to finish. So I thought why not to try to extract text from these docs and create so to speak OCR .txt files?
I will only need to insert page breaks - chr(12) - in these OCR files per each page in Word documents. Formatting is not important here: just line for line and page for page. Any thoughts?

2.To Gerry - probably your code is what I'm looking for. I only noticed that the destination file is also Word document, but I need it to be ASCII text file. Is there a way first to copy contents of the page, get it from say Clipboard, store in some string variable and then write to text file?

To both of you, guys - Is there a way, in case there's some special formatting in the documents to get rid of it or ignore it? Pardon my terminology.

Thanks guys. I really appreciate your time and help.
 
Gerry,

I tried the code you posted and there's problem: the code would select and copy, and paste only first page and doesn't move to the next page. What might be the problem?

Thanks.
 
1. The destination IS a text file.

Ooops, sorry, try this:
Code:
  With Selection
   .Copy
[COLOR=red]   .Collapse direction:=wdCollapseStart[/color red]
   ' activate other doc and paste
      TextDoc.Activate
   .Paste
   ' go to end and add Chr(12)
      .EndKey unit:=wdStory
      .TypeText Text:=Chr(12)
   ' return to source doc, go to next page
      SourceDoc.Activate
      [COLOR=red].GoTo What:=wdGoToPage, Which:=wdGoToNext, Count:=1, Name:=""[/color red]
   End With

Sorry, I was typing into the post directly...I forgot the wdGoToNext.....doh!

Gerry
See my Paintings and Sculpture
 
Hey Gerry,

1. Yeah, after looking at the code more carefully I noticed that the destination was text file indeed. My bad. Sorry.

2. I tried to play with this issue, but couldn't figure out all these WDs and other arguments for GoTo.
There's couple of more things:
a.) the line [red].Paste[/red] only works if I put it like that: [blue]TextDoc.Content.Paste[/red] ...
Is it supposed to work like that?
b.) after execution of line [red].EndKey unit:=wdStory[/red] insertion point goes right to the very end of the document. At least it did. I haven't tried corrections to your code yet.

Thanks a lot, Gerry.
 
Actually, I erally should have tried RUNNING the code in a document, rather than just WRITING in my posts.

I have found some odd behaviour. Usually an instruction NOT associated with a WITH can run fine. But for some reason, my instructions to activate the documents seems to clunk up within the With, as in:
Code:
With Selection
   .Copy
   .Collapse direction:=wdCollapseStart
   ' activate other doc and paste
      TextDoc.Activate

However, it definitely runs properly - now that I actually executed the darn thing...
Code:
   ActiveDocument.Bookmarks("\page").Range.Select
   
   Selection.Copy

   ' activate other doc and paste
      TextDoc.Activate
   With Selection
      .Paste
   ' go to end and add Chr(12)
      .EndKey unit:=wdStory
      .TypeText Text:=Chr(12)
   End With
   ' return to source doc, go to next page
   SourceDoc.Activate
   With Selection
   .Collapse direction:=wdCollapseStart
      .GoTo What:=wdGoToPage, Which:=wdGoToNext, Count:=1, Name:=""
   End With

Separating the Activation instructions out.

The .Paste should work, within the With, as above. At least it does when I run it.


Gerry
See my Paintings and Sculpture
 
Hey Gerry,

Just entered latest corrections into the code and looks like it's going to work. I'll give it a try from within my VB application.

Thanks a lot for all your help. I wish could give you more than just 1 star for that.

Have a great day.

Sergey.
 
Hi sglab

Have i got this right!? You are printing out a document of 15,550 pages and then scanning each page - and then OCR'ing the TIFF's to get text back again.?

Blimey!!!!!!

This is surely crazy! Not only will it take ages - I can't even imagine how long and I have been a scanner operator for 17 years - it will take up a heck of a lot of space - and be full of errors!

I must have understood - You are converting TEXT to a PICTURE then back to TEXT again

I eagerly await your reply

If this is the case i am sure we can help


Kind Regards
Duncan
 
Hi Duncan,

Well, actually you're a little mistaken here. We're not exactly printing Word docs - we converting them to TIFFs using TIFF printer driver. So there's no paper there and hence - no scanning.
As far as scheme TEXT - PICTURE - TEXT AGAIN, I could say that:
1. Word documents are not exactly text files;
2. We have to convert to TIFF since that's the format client - law firm - wants and it is pretty much industry standard;
3. We have to OCR since that is requirement too - client wants to be able to search the documents and in many litigation support applications OCR text is included as data field in load files.

I agree with you that it'll take up a lot of space, but then again, what can we do?

As far as OCR, usually the quality of OCR for images derived from Word documents is very high unlike the ones from PDFs, for example. So the idea of extracting text from original document rather than OCR-ing image serves only one purpose - speeding up processing and avoiding possible errors during OCR by not doing OCR at all.

And the last thing, what actually can you help me with?

Thank you.

 
Actually, I have to agree. This may be industry standard, but it is dumb.

It seems insane to have a large document, image each page, then OCR it so you can have text searching capabilities. The document has search capabilities.
So the idea of extracting text from original document rather than OCR-ing image serves only one purpose - speeding up processing and avoiding possible errors during OCR by not doing OCR at all.
What exactly is the "processing"????? Properly executed, you can do pretty much whatever you want with Word.

Sounds like a case for a needs analysis....industry standard or not.

Gerry
See my Paintings and Sculpture
 
Hey Gerry,

That was rather harsh.
I agree with Tony: law is an ass and dumb or not, this is how it works.

I agree that document has search capabilities, but what if you need to perform search based on some criteria across hundreds of documents? I think that task could take quite some time to accomplish.
On the other hand, text of the documents - these could be not only Word docs, but also Excel files, PPTs, MPPs, DWGs, PDFs, RPTs, emails and attachments from PSTs and NSFs, and tens of other formats - extracted during the conversion process and OCR - you could think of it as [red]PROCESSING[/red] - along with metadata could be stored as fields in a database system like SQL Server, for example and could be searched much faster using full-text search capabilities of the database.

By the way, Gerry, I modified the code to work from within VB and it did work. For help with my problem I thank you very much.
 
I am terribly sorry. I certainly did not mean to be harsh in any personal way. My apologies if you were offended.

I get the same thing within my own field. It has standards..including some dumb ones. I am not a diplomat (as is obvious I guess), and when I am asked to do a needs analysis, I do real ones. Not that I was asked in this case, and I think I should have kept my thoughts to myself.

I understand that organizations, particularly those with a long history, have great inertia, with processes and standards that can not be changed quickly - or at all.

I did some work for a aircraft technology company (cutting edge stuff, they even did work on the Air Force One fleet), but their documention processes were archaic...and dumb. Wasteful of resources, bloated documents that no one could read and on and on. Sure things were "documented", but they were documented badly. They did not like that I clearly demonstrated that is was a dumb way of doing things, but hated it more when I demonstrated, with real documents, how they could do it much better. I looked at earlier consultant reports. They delivered reports that only hinted that there may be issues, with suggestions that only tweaking was needed. Ha.

Again, I DO apologize if you, personally, were offended. I had no intention of doing that at all. However, unfortunately, just because something is standard, does not make it good, or smart, or efficient, or anything. Standards, unless continually checked again and again, ALWAYS ossify. Regardless of the industry, or organization, some standards are, in fact, dumb.

Here is one: a company that puts 600 page manuals on their intranet - as one big HTML file. This has become their standard internal documentation process. This is dumb.

My apolgies to you personally. Your industry will be able to safely ignore me....that's a joke.

Gerry
See my Paintings and Sculpture
 
No, I wasn't offended in any way and I don't disagree with you as far as standards go. The problem is that I'm not in position to change that. So what I got left? Just do whatever is needed to get the job done.
And as people in Russia say, 'Those who pay - they order the music'.
 
Thanks for your comments fumei!

Hi sglab

I too - as fumei was kind enogh to voice - think this is particularly dumb... but also do not mean to offend you. I do appreciate there is some awkward aspects to achieving your goal here.

(I have to spell it out in full i'm afraid)
But lets cut through the crap and agree one thing - when you OCR, i.e. OPTICAL CHARACTER REGONITION a document you are doing the following: - you are asking a program to start walking round each and every individual character, in whichever typeface that character is in, serif or sans-serif, it's particular size, weight, leading, tracking, kerning, horizontal spacing, proportional font vs. fixed font, etc. etc. etc. The OCR software even has to judge spaces and tabs - ambiguous in alot of situations. What is plain obvious to Microsoft Word, or any other WP s/w, and to you with your eyes - is simply not so for OCR s/w. You may not be scanning with hardware - but you are digitising the image to several megabytes - from several K... multiplying it up hundreds of times in size - only to turn it back into text again!? AND you will have suffered accuracy at the same time. I appreciate that this is not quite the same scenario when the source material is not within WP s/w. You mention that just one of your documents could be over 15 thousand pages long. Ouch.

SOLUTION

File / Save As... / Text Only (ASCII)


Kind Regards
Duncan
 
Duncan,

Since neither one of us is trying to offend the other - or so we say - let's stop to send apologies back and forth.
Or better off, let's just close this discussion.
Gerry has already helped me a lot, and I implemented his idea. But you don't seem to have any "know-how's" to offer -I won't even consider SaveAs option as the one - but only "why's". No offense intended...Again.

Kind regards

Sergey.
 
Hi Sergey

O.K. - I'll try to be of some use

How about using something like Perl with a module to parse Word files. Iterate through the document and save the text out... very quickly. You could incorporate a host of features while grabbing hold of the text in each page - like building an index, etc. I use Perl everyday for all sorts of things and think it would kick-arse in this situation. Or carry on learning VBA - it MUST be able to do what you need - SURELY!!!

It is not that i disagree with the TIFF thing - if that is the 'law' then you have to adhere to that. It is just that i have alot of image manipulation experience (17 years) and many years of Perl also. I just can't fathom the need to OCR what was a textual document - i.e. turning it into a pixellated image - then back into text again.

I really hope you find this post a little more helpful. I do want to help you find a (far) better solution - please believe me.


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top