Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using XFRX to Extract PDF's Text

Status
Not open for further replies.

stanlyn

Programmer
Sep 3, 2003
945
US
Hi,

Does anyone know if the XFRX sdk can create a searchable pdf from an image when outputting to pdf?

Is there a function that will return the text output from the OCR process available so it can be inserted into a table for searching?

Why is OCR mentioned here? Because if I load a non-searchable pdf in Acrobat, I have to run the OCR process to make it searchable, and it is this text I'm trying to programmatically extract to a var.

The answer and a lengthy discussion can be found within convoluted thread:
This thread is a continuation of this subject matter from that thread and also serves as a link.

My apologies goes out to Griff for hijacking his thread... Sorry buddy... I'll start new threads on all my future questions, no matter how small they may be.

Thanks,
Stanley



Thanks,
Stanley
 
Simply? No.

XFRX has not OCR.
If you have PDF with images (each page is image or each text is image) then you need extract images from PDF and use OCR.

How extract images from PDF:

How to extract text from PDF:

mJindrova
 
If the PDF contains an image, there is no way of searching it, because you can only search for text. That's true even if there is text contained within the image.

There are, however, a number of tools available for converting PDFs to searchable text. One that I have used is Nitro PDF Pro, which can convert from PDF to Microsoft Word. It's not free, but there is a free trial period. It doesn't always do a great job of preserving the formatting, but that's not an issue if your aim is to search the text.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
It might help to know what your overall goal is here. Do you need to search for any arbitrary text anywhere in a PDF? Or is the search more structured - searching for documents with specific invoice numbers or customer names, for example? If the latter, then it might make more sense to store the searchable text in an ordinary table. The user would search the table in the normal way, and that in turn would give a pointer to the corresponding PDF.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Let me take this over from thread184-1820648 to this thread, your own thread.

stanlyn said:
You keep saying I should know what the text is
Maybe it's clear by now I wasn't detecting these questions came from you, but thought they came from Griff.
Griff was generating PDFs from data, so he knows his texts.

I also wrongly addressed this to Griff:
me said:
Or what did you use so far? Printing to a TIFF file that is by default just images of the pages and then converting that to PDF?

That is exactly what you have. Then you only have images in your PDF. Take the topic aside of how PDFs are generated by VFP FRXes without or without tools like FoxyPreviewer or XFRX. Because those PDFs don't come from FRXes, they have a completely different origin and generation process. A scanner generating a PDF will usually only embed the scanned images into a PDF, nothing else. TIFF, even simpler is just a bunch of images, they are then also embedded into a PDF and that's it. Some scanners also come with OCR capabilities, but it's questionable whether they then combine their OCR capability and PDF generation capability.

As ever so often, why don't you try yourself to see whether you can search in a PDF file of your scanned documents within a PDF reader? Or whether you can select text in it? Only if that's given it's even viable to try to also do such things programmatically with any tools. But XFRX is not the place to start.

You can see from the name XFRX it's all about processing FRXes to other output formats, not the other way. It's not about PDFs text extraction in the first place. It's even unexpected it has that reader feature. That makes sense, if you know how to embed text into a PDF file you also can offer the reverse. It's not a given or natural to provide that in a tool that's mainly concerned to act as FRX converter to other formats, though.

Martina already gave you the answersof what XFRX can and can't do for you. From my comments on TIFF and scanning you could also already have deduced those originas prodce PDFs with images and not text. So from such PDFs you can't expect being able to read text, only images.

And one last thing, you asked atlopes:
Stynln said:
I looked and did not see pdf2xml or pdftotext utilities in the VFP9 help. Can you be more specific about their location?
Atlopes posted links, at least now his post contains links to pdf2xml and pdftotext. I think it always were links. So are you not aware that if a word or text is underlined blue it is a link in a tek-tips post? Click on them.

Chriss
 

Chris Miller said:
You can see from the name XFRX it's all about processing FRXes to other output formats, not the other way. It's not about PDFs text extraction in the first place. It's even unexpected it has that reader feature. That makes sense, if you know how to embed text into a PDF file you also can offer the reverse. It's not a given or natural to provide that in a tool that's mainly concerned to act as FRX converter to other formats, though.

Yes. Base functional is converts VFP reports to some output - most often PDF.
But PDF#READER has four basic functions:
- Read informations about PDF object, because XFRX supports append mode (add output from VFP report to existing file) for PDF.
- Extract images
- Extract attachments
- Read page's content


mJindrova
 
I'm curious.

What kind of application must extract text produced AFTER the pdf is created but cannot be extracted BEFORE pdf is created (as suggested previously). Either way (OCR or program code) it would seem to me a separate file of text or some other type would be produced anyway.

Just curious.

Steve
 
Well, Steve, two of those options have been mentioned:

The origin of the PDF is a TIFF file which only contains images
The origin of the PDF is a scanner - some sscanners actually come with a scan-to-PDF button and the software coming with it will turn the scanned pages to PDF. Which again is usually images only.

And indeed, as Griffs thread was all about PDF generation from an FRX, that puzzled me, too, but this thread isn't about FRXes. The problem case is you have a bunch of PDFs and think of the more general case the origin of the PDFs would be unknown, you would rather work on them with any PDF specific tool than with an FRX specific tool. Stanlyn only asked because XFRX was mentioned in Griffs thread.


Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top