Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

XFRX - Are newer versions any faster... 3

Status
Not open for further replies.

GriffMG

Programmer
Mar 4, 2002
6,333
FR
I've been using xfrx for a very long time, 2007 I think, and am currently using (I think) version 191.9

I have a project where I need to produce a very large number of reports (call it 2 million) in PDF format and
it looks like it will take about 10 days to complete... which is a tiny bit too long.

As part of my efforts to improve this, I am wondering if anyone has tried any of the more recent versions
before I invest in updating mine.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
I'll try that Vernpace, thank you

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
Thank you Vernpace, that made a significant difference, I put it in the process just before each report was run.

Not sure of the overall effect yet, I will measure in due course... big leap forward

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
Griff, more times than not, the simple solutions are the best.
 
Chris,

Hmm, let me try this a different way...

Chris said:
Even just the question is pointing out you have a totally wrong idea about PDF generation. Or what did you use so far? Printing to a TIFF file that is by default just images of the pages and then converting that to PDF? I mean, you have to work hard to get a PDF that's only composed of images and thus is not searchable.

I have been processing tiffs, jpgs and many other raster formats for 25+ years using the Leadtools 11.5 SDK, where I load the raster image, then OCR it. At this point, I have the text that was produced by the OCR process to do whatever is needed. I normally save it into a field appropriately named "ocr_text" which facilitates full-text searching.

For those wondering about performance, doing a full text search across multiple fields in a 900,000 row DBF, it takes about 2 seconds. The "ocr_text" field in each row averages about 3000 characters. I use PHDBase on DBFs to get the speed and has worked reliably since 1998...

It took six months to solve this DBF speed issue back then as using native Fox, there was a one-to-one relation between the number of rows and time to fetch. Using native Fox, if it took one second to search 10000 rows, it would take 2 seconds to search 20000 rows. I knew that when finished getting all their data in there would be around 350,000 rows, making the app not usable as no one would wait for 2 minutes for results.

Doing the same in a MSSQL table with a catalog that has 4.5 million rows with the same data returns results in about a second. MSSQL is much faster.

I have zero experience doing the same for pdfs in code, which is why I asked here... Creating the pdfs is easy, and I have lots of experience doing that.

My original question was "how is the text exposed" so I can grab it and stuff it into a field, all in code. I've used both FoxyPreviewer and Ghostscript + pdf driver to create the actual pdf files and have never found how to get only the text & page number from it.

I know that any pdf viewer will allow manual copying the text from the file, however, this is not what I need. I need to create the pdf, then stuff the pdf text into a table, and all programmatically.

It would also be good if the page number is also available for multi-page files so additional things can be done.

Thanks,
Stanley
 
Stanley,

Stanley said:
I need to create the pdf, then stuff the pdf text into a table, and all programmatically.

You can invoke one of pdf2xml or pdftotext utilities from VFP and check for the output.

For instance, pdf2xml produces an XML document that can be loaded into an MSXML2.DOMDocument object. To get the pages' contents would be:

Code:
m.Pages = m.XML.selectNodes("/DOCUMENT/PAGE")

FOR m.PageNumber = 0 TO m.Pages.Length - 1

  m.PageContents = m.Pages.item(m.PageNumber).text

  * do whatever you need to process the text

ENDFOR
 
Stanlyn said:
need to create the pdf, then stuff the pdf text into a table,
That demand seems odd to me, I didn't see that coming. Because isn't the origin of texts in the final PDFs text or char values of your data? So you know the texts. So is it to know on which page they end up?

Examining a PDF programmatically isn't an easy task, as it embeds streams that are compressed, so you don't find the text, neither in ASCII, Ansi, nor in some Unicode variation.


I don't know why you think along these lines, you are the creator of the PDFs by reports, aren't you? The time to do this is while you report. You have the _pageno telling you the current page.

Instead of printing "field" you print "logpage(field)" and thereby call a function (known by SET PROCEDURE or a prg of that name in a path VFP finds).

And the logpage function will just do that:
Code:
Lparameters tvField
Insert Into pagedata (iPage, mText) values (_Pageno, transform(tvField))

Return tvField

Creating a pagedata cursor or dbf is just done before the report run.

You know this at report time, so you do this at report time. It doesn't even need anything fancy like report listener.

Chriss
 
Here is the pagedata for the 90frx. I did modify the expression of the description report field only, but you get the idea. You can call the same log function from any object in the report aso from the labels with their directly given text, if it doesn't come from report data. It's just a matter of passing everything through logpagedata.prg, which logs and then returns it.

Code used for printing:
Code:
Create Cursor pagedata (iPage int, mText M)
Report Form 90frx_withlogging.frx Preview && modified 90frx.frx

Data in attachment.

Chriss
 
 https://files.engineering.com/getfile.aspx?folder=a25ced25-9f68-4a9d-8b9b-da28e7b55f08&file=pagedata.zip
Griff said:
My original question was "how is the text exposed"
I don't see that asked here, but maybe that's why you say it now.

You literally asked:
Griff said:
Does anyone know if the xfrx sdk can create a searchable pdf when outputting to pdf?

And the idea I got from that is that you want the PDFs to be searchable for the end users, i.e. the PDFs are useer friendly.

You also asked
Griff said:
Is the text output from the OCR process available so it can be inserted into a table for searching?

I don't even know how you get the idea of an OCR process. There is text to begin with, not an image of text. We are still talking about you running FRXes, aren't weß And I assume the texts you bring in are stored as texts, character data, or numbers, dates, etc. that are transformed into text when printing them. OCR is text recognition from images, but you don't start with images you print, do you? If you do, that would just be new information you didn't tell us beforehand.

I really don't get your thinking. Regarding to know where the text is in the PDF, there is a point when you know this: While you print. And that's where to "note this down" - or log this to a table. That's the natural way of programming. Doing things at the times they are easiest to do. Why would anyone do this as a postprocessing when you have the _pageno at hand during reporting?

Chriss
 
Err, did I ask that? or was it

Is there an option to have searchable text instead of images?

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
Yes, that's all just about whether text you print is turned into iages or not. And it's not, unless you actually force it. By default the PDFCreator does not turn text it prints into images. By default foxypreviewers PDF implementation does not. You seem to think that is the norm, even Martina seems to think that. It's not.

If you open a PDF in notepad and you don't see your texts in it, that's not because they are turned into images. A PDF reader does not do OCR on images of text to search text. It uncompresses the streams that are embedded in the PDF, and I'd not dive into the details to be able to get there, too. Why, if you can have it while you create the report? Doesn't even matter what the result is, you know on which page each text goes.

Chriss
 
stanlyn said:
I know that any pdf viewer will allow manual copying the text from the file, however, this is not what I need. I need to create the pdf, then stuff the pdf text into a table, and all programmatically.

It would also be good if the page number is also available for multi-page files so additional things can be done.

XFRX has class PDF#READER, this class can get page content:

mJindrova
 
Griff,

again: If you concatenate all output data of the logpage function you get a fulltext representation of the PDF with pageno excluding images, formatting/layout. If that's what you want to extract from the PDF, then why go that route if you already have it while you create it?

Chriss
 
Chris

I didn't ask to do that, so far no one has asked me to make the pdf searchable (they might, but they haven't), I asked Martina just out of interest.

I have the original data, I can search that for anything I need...

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
Then what do you still ask? You've been answered that a PDF is not images of text very early on.
Martina said:
mJindrova said:
Why XFRX or FoxyPreview can "print" PDF where text is text? Because "print" not is "print" but "export". In both cases is derived report listener is wraper only. Gets information about objects (from bse report listener) and create PDF.

And your said
Griff said:
I need to create the pdf, then stuff the pdf text into a table, and all programmatically.
Way after Martina already told you a PDF is not images of text because how it's done in XFRX and foxypreviewer. And I told you (and her) PDFCreator also does not turn text into images.

Well, and you say it yourself now, of course you can search the original data. But you explicitly asked how to get text from a PDF, didn't you? With that quoted question. If you don't nee d to do that, than why do you tell that you do? You're just contradicting yourself. Or you become senile.

I really just want to help you get what you want and need. But your posts make it impossible to get a clear picture of your demands, really.

Chriss
 
Chriss

I didn't say that, maybe Stanlyn did, but don't attribute it to me.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
Griff,

fine, yes, I overlooked that. Makes things a bit clearer. The lack of information remains of what now is your bottleneck of processing and how to help with it. Anyway, I think I'll just opt out of this thread, as it's very convoluted and I also don't want to contribute more to that.

Chriss
 
No problem Chriss, have a nice weekend.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
Chris said:
Because isn't the origin of texts in the final PDFs text or char values of your data? So you know the texts.

Not necessarily, 99% of my documents comes from scans to .tif and .jpg, which is image only.

You keep saying I should know what the text is. The process goes like this...
1. scan a document to a .tif,
2. ocr the document and store the ocr text to the table.
3. repeat

In a new version of the product, I need to support searching a large documents database containing all raster formats as well as pdf files that I do not have their source and did not create them, therefore, I don't have the text.

As I understand it, PDF files have 2 layers, one image and one text where the text layer is empty if the pdf in not searchable. If searchable, the text layer contains the text as well as positional metadata.

Original question is: how do I use Foxy or XFRX to get the text layer data so it can be saved into a table for searching?

Thanks,
Stanley
 
Atlopes said:
You can invoke one of pdf2xml or pdftotext utilities from VFP and check for the output.

I looked and did not see pdf2xml or pdftotext utilities in the VFP9 help. Can you be more specific about their location?

Thanks,
Stanley
 
StanLyn & Chris said:
need to create the pdf, then stuff the pdf text into a table,

Let me restate this differently to avoid confusion...

I need to create the pdf by printing a .tif to .pdf or converting a .tif to a pdf. Therefore, ocr must be done at some level and my question was if XFRX can do this?

Thanks,
Stanley
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top