Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Is there any way to read the data in a PDF file? 1

Status
Not open for further replies.

JCruz063

Programmer
Feb 21, 2003
716
US
Hi All,

I'm working on a project in which I might have to programmatically read the data in PDF files. I need to parse the PDFs (as if they were text files) and understand what each element of data is.

Let's say, for instance, that the content of one of the PDFs looks like this:

[tt]001 - Microsoft Word
002 - Microsoft Excel
003 - Microsoft Access
004 - Microsoft Power Point[/tt]

I will need to parse this information, extract the numbers, and attach them to the corresponding strings. That is, 1 is "Microsoft Word", 2 is "Microsoft Excel", and so on. Hopefully all this makes sense. The bottomline is that I need to be able to know what the PDF file contains using a programming language. I won't be literally looking at the file; it'll all be done programmatically. Each number and each string has a speacial meaning in the project, thus I need to accurately extract each number and each string for the project to be successful.

The programming language to be used is either Visual Basic 6.0 or C#.NET.

Is there a specific way to do this?

Thanks!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
Thanks for your reply Thomas!

Now, did that link mean yes? Can I accurately parse/extract numbers and strings from PDF files?

Thanks!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
It depends entirely on the PDF. If the PDF contains scanned paged images, for example, then no, you can't extract anything meaningful.

Extracting text can be tricky too, because of the nature of PDFs and the original source documents.

My link was meant to show you that, if you don't want to try to process/parse the PDF itself, you have to use an API. That link is to the API.



Thomas D. Greer

Providing PostScript & PDF
Training, Development & Consulting
 
Thomas,
Thanks again!

Well, I guess that's good news for me, given the fact that my PDFs will be composed of just text. They have no images. Just plain text (as that in the example in my first thread), inserted in tables. Some of the table cells may have different background colors, but that's about it. So how exactly do I go about parsing them?

Thanks!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
Thanks Tom; I'll look into it!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
Thomas,
I started looking at the documentation of the API you pointed me to. There are quite a few things that I need to learn and get used to. Before I continue my journey, though, I would like to know something: How accurate will the parsed text be? In other words, will I be losing some accuracy in the extracted text?

The reason why I ask this is that I have a PDF utility that converts PDF files to a variety of formats, including .txt. Using such utility, I converted one of the files that I'll be working with, and 99% of the words in the resulting text file contain spaces that do not exist in the original PDF document. The word "Network", for example, was transtlated as "N etw ork" (with a space after the 'N' and after the 'w'). The word "CART" came out as "C AR T". The number "292136" came out as "2 92 136" and, now that I take a closer look at the text file, all words have these kind of problem.

Based on what I've read thus far from the API documentation from DataLogics, I get the impression that the translated text will be accurate. The results of the utility that I have are outragously unacceptable, however, and thus, I want to make sure that this API will be useful.

Again, the PDF documents I'll be dealing with are composed of simply a table with values (numbers and strings) in its cells. Before the table, the document may also contain some heading text, describing some useful information about the itself. The text in the first row may be underlined and some cells may have different background colors. There is no other content in the file whatsoever... So, again the question is: Will the text be accurate?

Thanks!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
I don't know. I don't use the API. PDF is built upon the PostScript imaging model, and I'm a PostScript programmer. So I do most of the nitty-gritty work in PostScript.

The reason why your text is split up is because of what is called "kerning". Individual words are split up into characters and substrings so that very fine adjustments can be made to the spacing between characters for enhanced readability.

That's why PDF isn't a useful format for data processing.

I have noticed that Acrobat itself somehow concatenates strings that have been so treated back together for selection and searching. So with their API, presumable you could extract entire strings.

There is an interesting series of articles on the internal PDF structure here:


Thomas D. Greer

Providing PostScript & PDF
Training, Development & Consulting
 
Thomas,
Thanks once again!

I spoke to someone from DataLogics and they said I won't have an accuracy problem with the API. I'll try out the API and see what results I have.

Thanks for the article link also. I had actually read this article before, and found it a bit complex. I'll re-read it to better work with the API.

Thank you!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top