Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Import OCR PDF File into Access

Status
Not open for further replies.

BikeToWork

Programmer
Jun 16, 2010
50
US
I am attempting to get data into Access from a pdf file which was created by OCR. The pdf file has a table format to it but if I attempt to export it to Excel, some of the data shows up in cells while the rest of it puts all of the fields from one row into one cell. In addition, I've tried the export to xml and rtf but the data comes out skewed. I've also tried selecting pages from the pdf and exporting it to html, with very limited success. Am I missing something simple? Thanks for any advice on this.
 
I attempt to export it to Excel [...]I've tried the export to xml and rtf

How did you do that?
Did you copy data / paste it? If so, what do you get if you do that to MSWord? In Word you can see the characters that are between your data, like a Tab or a Space.

Have fun.

---- Andy
 
Andy, thanks for the response. I tried exporting from Adobe Acrobat in all of the export formats it supports. The data still came out garbled because it is not a real table (although it does look like one) but rather OCR data. Some of the pages were fine but most were not. I also tried export to rtf and export to Word, neither of which worked successfully. In Word the data is strung together and fields are not delimited properly. I figure it is probably a GIGO problem and that there is not much one can do with garbled OCR data.
 
fields are not delimited properly" - how are they delimited?
Is there any logic to how they are delimited?
if there are several ways of delimiting, you can code agaist that to make it 'pretty'.

Have fun.

---- Andy
 
I think what you have is not actual text characters, you probably have a bit-mapped image that is formatted in a way your particular OCR software needs it so it can be read by that software. If so, you'll have to write a .net program using the OCR interpreter's API to dump the data into an ACCESS table read it into the interpreter, there may be no way to handle it in Access. The vendor is not going to make this easy, after all, why would you need their software if you could just read the document with Access? They are going to want a developer's license fee out of you at the least.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top