Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting text from PDFs.

Status
Not open for further replies.

mmerlinn

Programmer
May 20, 2005
748
US
I have searched the forum list and have not found the correct forum for the problem I am having. Are there any forums for PDF? Or Excel? I can't find any in the forum list.

I am getting about 5 PDF files per month containing tables with about 3500 rows that I need to port to Excel, but I have not found a way to do it in one step, even though I have spent hours searching with Google. The best I have found is to convert PDF to XLS ( scrape the XLS files with FoxPro into CSV files, then upload the CSV files into Excel.

I would like to find something free that either converts PDF files directly to something Excel can use, or converts PDF files to CSV files which I can then upload into Excel.

The other alternative is that I scrape the PDF files with a FoxPro program directly, but so far I have not found anything that gives me any indication of how I can extract the needed text from the PDF files.

So, my questions are:

1) Is there a PDF or Excel forum where this would be more appropriate?
2) If not, does anyone know of a free PDF to CSV converter that works with XP or MacOs9?
3) Or does anyone know how to scrape the text from PDF files using FP?

When I scrape files I am limited to FP2.6 on a Mac, so using Win plugins won't work.

mmerlinn


Poor people do not hire employees. If you soak the rich, who are you going to work for?

"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Raymond
 
I have had a situation where we used VFP to automatically create PDF's to sent to the customers and then, as a secondary customer send, had to 'convert' the PDF to text to send as a Text 'image' to the customer's web-consolidator, because they wanted the Text 'document' lay-out just like the PDF.

I used a tool called:
PTConverter
(
My client's VFP application had to launch the conversion via Command Line strings, but the tools did the job reasonably well.

There were a few of after-conversions that I needed to make due to its missing a few spaces between words, etc.
But since the PDF documents were largely standardized, and those conversion 'misses' repeatedly done, in the same manner, -- they were 'findable' and changeable with STRTRAN().

Good Luck,
JRB-Bldr
 
I don't envy you this requirement with FP/Mac as the driving force. I spent a lot of time on the pointy end of FP/M but not since 1997.

The best information about working with PDF files will probably come from the Adobe forums (at adobe.com), but don't expect anyone there to have any information about working with a version of Foxpro that is well-outdated. (Heck, I'd be surprised if they could help with the *current* outdated version of VFP. <g>)

Check out the Adobe forums. See what they have to say.
 
Oops I guess that I over-looked the part - I am limited to FP2.6 on a Mac

If that is the case, then recommending a Win application to use won't be much use to you.

Good Luck,
JRB-Bldr
 
If I don't need to scrape the files, a Win App will work for me, as I have a Win machine available. I just don't have any way to scrape files and build CSV files on a Win machine. Once everything is in the Mac Excel file, I can do everything I need to do without ever needing to use FP.

Scraping files with low level functions is one of the easiest tasks I have ever done in FP, so I don't really need to find anyone that knows anything about FP.

I either need a direct converter (which bypasses FP totally), or I need to find out how the text is formatted in the PDF files so I can scrape the files and avoid the extra step I am currently doing. If I understand how the PDF files are organized, I can modify one of my low level FP programs and spit out CSV files to be uploaded to Excel. In fact, I would probably modify the XLS to CSV converter that I have already written since I would no longer need it.

Both of you have given me some ideas. As soon as I can, I will follow up. Maybe one of those ideas will work for me.

mmerlinn


Poor people do not hire employees. If you soak the rich, who are you going to work for?

"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Raymond
 
I just don't have any way to scrape files and build CSV files on a Win machine.

Well the PTConvert application that I mentioned above will do that - or at least create a TXT file which you can then use FP to convert into CSV.

Based on my own experience with the tool, the only issue you will run into is where it 'missed' the conversion on one or more part(s) of the text. However, my own testing (albeit not totally comprehensive) showed this conversion tool better than many.

And it can be run via FP automation by having the FP application build and issue the appropriate command string.

Actually what I did was build the text string which I then wrote to a BAT file with a STRTOFILE(), after which I then executed the BAT file with ShellExecute().
I don't remember now why I did it in that convoluted manner, but it worked.

Good Luck,
JRB-Bldr

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top