OCR Progress, but still need some help

Scott24x7 · Sep 2, 2015

Hi All,
So I've managed to make good progress on my OCR issues, using EZTWAIN and Transym. But funny thing is, neither company have any VFP experience in this space.
EZTWAIN integrate nicely with it, and to that end I'm now able to scan (a business card) front and back simultaneously, and then output 2 (well 4) files.
First is a JPG image of the front and back of the card. Very useful for displaying in an image file which is linked by path in a Memo field.

Second, the Scan with Transym utilizing one of it's calls outputs a PDF that is OCR'd. So I have the file sitting there.
But... how do I "scrape" the data from a PDF (if I look at it manually I can "select" the PDF image, Open it, CTRL+A it (select all) then CTRL+C it (copy all) and then... how can I maybe, store it in a Memo file which will then allow me to "parse/evalute" it's contents so that I can use the results to populate known fields. (I fill figure out how to do all the parsing maybe with some questions later), but if I can just OPEN the PDF and CTRL+A then CTRL+C it, I can then REPLACE Table.OCR data with the contents of that copy, and THEN I can manipulate the memo field data.

Any ideas?
Many thanks,

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

Scott24x7 · Sep 2, 2015

Thought my code might be helpful:

Code:

LOCAL fileName AS STRING
LOCAL i AS INT
LOCAL hdib AS HANDLE
*
TWAIN_SetHideUI(1)
TWAIN_SetFileAppendFlag(0)
OCR_SelectDefaultEngine()
TWAIN_SetAutoOCR(1)
TWAIN_SetJpegQuality(100)
IF TWAIN_OpenSource("Canon P-208II TWAIN")<>0
	TWAIN_EnableDuplex(1)
	TWAIN_SetPixelType(2)
	TWAIN_SetResolution(600)
	TWAIN_SetAutoDeskew(1)
	TWAIN_SetXferCount(-1)
	TWAIN_SetAutoScan(1)
	TWAIN_SetRegion(0.0, 0.0, 3.55, 2.15)
	TWAIN_SetMultiTransfer(1)
	i = 1
	DO WHILE .T.
		lcTextMergeStringIMG = ADDBS(SYS(5)+SYS(2003))+"\CARDS\"+;
			ADDBS(ALLTRIM(STR(COMPANY.COMPANYID)))+ADDBS(ALLTRIM(STR(CONTACT.CONTACTID)))+;
			ALLTRIM(ThisForm.BasepageFrame1.Page7.txtContactFirstname.Value)+;
			ALLTRIM(ThisForm.Basepageframe1.Page7.txtContactLastname.value)+"<<i>>.jpg"
*
		fileNameIMG = TEXTMERGE(lcTextMergeStringIMG)
*
		lcSaveNameIMG = TEXTMERGE("\CARDS\"+;
			ADDBS(ALLTRIM(STR(COMPANY.COMPANYID)))+ADDBS(ALLTRIM(STR(CONTACT.CONTACTID)))+;
			ALLTRIM(ThisForm.BasepageFrame1.Page7.txtContactFirstname.Value)+;
			ALLTRIM(ThisForm.Basepageframe1.Page7.txtContactLastname.value)+"<<i>>.jpg")
*
		lcTextMergeStringPDF = ADDBS(SYS(5)+SYS(2003))+"\CARDS\"+;
			ADDBS(ALLTRIM(STR(COMPANY.COMPANYID)))+ADDBS(ALLTRIM(STR(CONTACT.CONTACTID)))+;
			ALLTRIM(ThisForm.BasepageFrame1.Page7.txtContactFirstname.Value)+;
			ALLTRIM(ThisForm.Basepageframe1.Page7.txtContactLastname.value)+"<<i>>.PDF"
*
		fileNamePDF = TEXTMERGE(lcTextMergeStringPDF)
*
		lcSaveNamePDF = TEXTMERGE("\CARDS\"+;
			ADDBS(ALLTRIM(STR(COMPANY.COMPANYID)))+ADDBS(ALLTRIM(STR(CONTACT.CONTACTID)))+;
			ALLTRIM(ThisForm.BasepageFrame1.Page7.txtContactFirstname.Value)+;
			ALLTRIM(ThisForm.Basepageframe1.Page7.txtContactLastname.value)+"<<i>>.PDF")
* If you can't get a Window handle, use 0:
		hdib = TWAIN_Acquire(THISFORM.HWND)
		IF hdib=0
			EXIT
		ENDIF
*
		SELECT (ThisForm.ActiveDBF)
		IF i = 1
			REPLACE CONTACT.BUSINESSCARDFRONT WITH lcSaveNameIMG
		ENDIF
		*
		IF i = 2
			REPLACE CONTACT.BUSINESSCARDBACK WITH lcSaveNameIMG
		ENDIF
*
		DIB_WriteToFilename(hdib, fileNameIMG)
		DIB_WriteTOFilename(hdib, filenamePDF)
		DIB_Free(hdib)
		i = i+1
		IF TWAIN_IsDone()<>0 THEN
			EXIT
		ENDIF
	ENDDO
	TWAIN_CloseSource()
ENDIF
*
IF TWAIN_LastErrorCode()<>0
	TWAIN_ReportLastError("Unable to scan.")
ENDIF
*
SELECT (ThisForm.ActiveDBF)
*
IF NOT TABLEUPDATE()
	WAIT WINDOW "Unable to Save"
	TABLEREVERT()
ENDIF
*
ThisForm.LockScreen = .T.
ThisForm.Refresh()
ThisForm.LockScreen = .F.

I just need to manipulate the textual date from the PDF after it is written (despite writing front and back, I'm only interested in front).
Any ideas, suggestions or specific examples would be awesome.

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

Olaf Doschke · Sep 2, 2015

The PDFs are the result of OCR parsing of the JPG image scan? Do I understand that correctly? And Transym offers no other output of the recognized textual data?
You can of course read any file with FILETOSTR() and you can APPEND MEMO FROM Filename. Makes me wonder though, why you now intend to embed files int memo fields again?

Bye, Olaf.

Scott24x7 · Sep 3, 2015

Hi Olaf,
Yes, the PDFS are fully OCRd versions of the Scan. So what happens is this:

The card is scanned, I save it twice. Once as a PDF (which Transym has OCRd) and once as .PNG which is the image that I keep in the record (it's just the file path where the image is stored, as we covered before it's not physically stored in the table.

Now the next step, what I want to do is get the "text" data from the .PDF. That data I will shove in a Memo field (not the image itself). Just the text. Once I have the Text in the Memo, I can then run some "intelligence" on the individual lines and determine:
Is this a company name?
Is this a person's name?
Is this an Address?
Is this a URL
Is this a Mobile phone?
Is this a Fax number?

(You see the idea).
Once I can "parse" that data I will use it to populate several other fields in multiple tables, so that the data does not have to be "hand typed" into the record.
So that last part is my problem to figure out, but the piece I'm struggling with at the moment is: How do I get the OCR data out of a PDF (in short...)

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

catabar · Sep 3, 2015

You can use PDFTOTEXT in command line. It's free, good, quick and accurate.
Download it from

http://www.foolabs.com/xpdf/

You will find there other great tools for manipulating PDFs.

Best regards
Cata

Olaf Doschke · Sep 3, 2015

From what I read at Transym, it can output as txt, so why not take that option? The PDF might keep the format, so it's worth keeping, too, but the final result of OCR should be txt anyway, shouldn't it?

Bye, Olaf.

Scott24x7 · Sep 3, 2015

Olaf,
That's an interesting notion. I just am not familiar enough with working with APIs to understand how to get that call to work... :/
Not sure what I need to do there. Will take a look and see if I can get it to do it. You may be right then, no need to keep the PDF, as I keep the .JPG for displaying the image.

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

Olaf Doschke · Sep 3, 2015

Well, I said the PDF may be worth keeping for having the text in the format, but actually I haven't seen it.

The information about converting to simpler txt file comes from the FAQ pages of Transym, if you expand the question about supported file formats it's not just informing about the input format options.
It should be a setting in the same manner as I assume you needed to setup the automatic hooking into twain scan events, or was that a default anyway?

From my scanner/printer I see default configurations write scans into own documents->pictures library Scans subfolder, so txt files may already be written in your documents folder or some subfolder?
Even if documentation would not guide you enough, you know that you could use process monitor to see what some Transym exe or dll does and what files it addresses or creates?

Bye, Olaf.

Scott24x7 · Sep 3, 2015

If you look at my code above, you can see that the outputs are specific, and they are determined (amazingly) by just putting the extension you want on them.... but that's out of EZTWAIN. It doesn't do the native "write to .txt" from OCR, but the OCR call does OCR the PDF automatically... so when I tell it to write to PDF (this is a EZTWAIN call, not a Transym call), EZTWAIN takes back the OCR'd PDF from Transym and writes it out when I include the .PDF extension. Kind of "black box magic" from my perspective.

I have discovered that the higher resolutions are needed for good OCR. 200dpi was fast, and reasonable enough for card image, but for OCR 400dpi is minimum to get descent results. Higher if the card has a fine print.

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

Olaf Doschke · Sep 3, 2015

Anyway the option to get txt should be in there somewhere, too.

Bye, Olaf.

Scott24x7 · Sep 3, 2015

Let me try writing it out, and see if by magic it just works.

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

Scott24x7 · Sep 3, 2015

Mmm, nope that didn't work.
Just give me a giant binary file which when I change the extension from .TXT to .BMP, they show as .BMP images... so it's still grabbing the whole raw image for that extension type...

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler." [hammer]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

OCR Progress, but still need some help

Scott24x7

Programmer

Scott24x7

Programmer

Olaf Doschke

Programmer

Scott24x7

Programmer

catabar

Programmer

Olaf Doschke

Programmer

Scott24x7

Programmer

Olaf Doschke

Programmer

Scott24x7

Programmer

Olaf Doschke

Programmer

Scott24x7

Programmer

Scott24x7

Programmer

Similar threads

Part and Inventory Search

Sponsor