Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

OCR Zoning - Field data for auto-sorting is not being considered in 1 of 2 possible cases. 1

Status
Not open for further replies.

TomFra

Technical User
Aug 18, 2022
6
LU
Hi there,

Explanation about the problem, what the script is supposed to do and what it actually does.

We use doc scanners to automatically digitalize & sort, via OCR Zoning and a vbs to get pdfs on the correct subfolders on the server.

The script is supposed to verify if there is a valid content in that zone or not - if so - classify the file in the folder, if not, put it in a "to be sorted" folder.

When the document is scanned, 99% gets sorted perfectly.
For those documents that were rejected, they fall in specific "to be sorted" folder as TIFF an are re-injected manually. This is done when the initial process got interrupted because of timeouts, network errors etc.

My guess - some steps between the "receiving Text Zone" and the "Text zone after space trim". It seems to have different impact on the paper scan and the processed tiff file.

I'll attach you the rar with the txt files (The post only allows one file but it's helpful to see the difference between both cases)

The same script is used for paper-scans and the re-processing for the tiff-files that fall into a "rejected" folder for unknown failure reasons.

Oh, for explanation : our references are built like this : 22-08-0155-01 meaning year, month, filenumber and version. our server folders look like this
\\blablabla\missions\2022\08\0155\ <=== the file would be supposed to land here. 99% of the files land here after paper-scanning.
\\blablabla\missions\to be sorted\ <=== here's where literally everything lands wich is re-processed. I mean 10% would be okay because no input in OCR zoning etc, but currently, nothing gets sorted - BUT - see the logfile - you can see that the trimming works and the reference number is found but... "match not found."

Sorry for the tonns of text - as it is very specific I tought you might wanna have a lot of infos "isolating" the potential error source.

For the explanation of the rar:
Error log tiff scan - the log i ged when i re-process the document - see for the "no match found !" message. That's the issue I'm trying to figure out.
Success log paper scan - same script running - document scan - works like a bliss.
Script - the magic happens here

Many thanks to those who took the time reading my post and still have the courage to look into the script & log files ! ;)

TF
 
 https://files.engineering.com/getfile.aspx?folder=d5517207-9a52-47f4-b2de-becbdd760b72&file=script_and_logs.rar
My guess - some steps between the "receiving Text Zone" and the "Text zone after space trim". It seems to have different impact on the paper scan and the processed tiff file.
Have you looked into the script?
Between these 2 steps there is only trimming of the string ocrzone and removing spaces from it:
Code:
	EKOManager.StatusMessage(">>> Received text zone: " & ocrzone)
	
	Dim sTxt
	Dim sTxt2
	Dim sTxt3
	Dim sTxt4
	
	ocrzone = Trim(ocrzone)
	ocrzone = Replace(ocrzone, " ", "")
	EKOManager.StatusMessage(">>> Text zone after space trim: " & ocrzone)

BUT - see the logfile - you can see that the trimming works and the reference number is found but... "match not found."
if you look into the files, the result match found comes when the pattern like 22-08-0091-01 is found in the string ocrzone:
Code:
Information	>>> Received text zone: [highlight #FCE94F]22-08-0091-01[/highlight]	08/18/2022 14:55:58
Information	>>> Text zone after space trim: [highlight #FCE94F]22-08-0091-01[/highlight]	08/18/2022 14:55:58
Information	        - [highlight #FCE94F]match found[/highlight]: 22-08-0091-01	08/18/2022 14:55:58
but the result no match found! comes when there isn't pattern like 22-08-0091-01 in the string ocrzone, but instead of it the string contains this [highlight #EF2929]~FRO::%OCRZone%.1~[/highlight]
Code:
Information	>>> Received text zone: [highlight #EF2929]~FRO::%OCRZone%.1~[/highlight]	08/18/2022 15:01:58
Information	>>> Text zone after space trim: [highlight #EF2929]~FRO::%OCRZone%.1~[/highlight]	08/18/2022 15:01:58
Information	        - no match found!	08/18/2022 15:01:58
I guess, that elsewhere in your process this variable [highlight #EF2929]~FRO::%OCRZone%.1~[/highlight] should be replaced by the pattern like this [highlight #FCE94F]22-08-0091-01[/highlight], but the replacing failed. Therefore your script gives the result pattern not found!.
In the code you provided there is only one subroutine, but the error is not in this subroutine.
 
In the successful case with the result pattern found your log seems like this
Code:
Information	OCR:  Performing zoned OCR...	08/18/2022 14:55:58
Information	OCR:  Adding zone RRT ~FRO::%OCRZone%.1~=[highlight #FCE94F]22-08-0091-01[/highlight]...	08/18/2022 14:55:58

In the case with pattern not found your log seems like this
Code:
Information	OCR:  Performing zoned OCR...	08/18/2022 15:01:58
Information	OCR:  Adding zone RRT ~FRO::%OCRzone%.1~=[highlight #EF2929]LL—UO—UU7 1—ti[/highlight]...	08/18/2022 15:01:58

This implies, that you should search for the error in the processing step which writes into log this:
Code:
Information	OCR:  Adding zone
This processing step comes prior to VBscript processing - see the logs:
Code:
...
...
Information	OCR:  Extracting data from document page 1...	08/18/2022 15:01:58
Information	OCR:  Performing zoned OCR...	08/18/2022 15:01:58
Information	OCR:  Adding zone RRT ~FRO::%OCRzone%.1~=LL—UO—UU7 1—ti...	08/18/2022 15:01:58
...
...
Information	>>> Entering VB script ===============================================>	08/18/2022 15:01:58
Information	>>> Received text zone: ~FRO::%OCRZone%.1~	08/18/2022 15:01:58
Information	>>> Text zone after space trim: ~FRO::%OCRZone%.1~	08/18/2022 15:01:58
Information	        - no match found!	08/18/2022 15:01:58
...
...
 
Hi Mikrom, thank you for the feedback !

I did some quick tutorials and read a ebook for some basic notions on vbs as this is not my element at all, I have some comprehension now for loops and essentials like dim, fso, do while else etc but this script here goes a bit further so my understanding is... well... moderate ;) Added to this comes that english is my 4th lango and the one i learned last and ... not so long, wich adds some "challenge" to the entire situation. Therefore I joined the forum to look for some external knowledge and wisdom to help me out :)

I saw these lines, thats why i wrote that the paper scan results into a "match found!" and whan i use this exact file for the "re-processing" test, i get the second logfile with, now - doesn't seem to obtain the same result in the Ocr Zone trimming and, as a consequence, gives a "no match found!" result. But... why ? it's the same file ? The same Ocr zoning, the same script ?

I'll check for the "adding zone" part and see what happens here and why.

Thank you for your precious time and help ! :)

TF
 
Short update to this post ==> I think I might have found a track.

When the Image is scanned, it's digitalized into a tiff first. OCR Zoning is done, File is sorted properly.

For tests, I take the final file (wich is now a pdf file) and I wonder, if i re-process it, if there are now "margins" redefined wich would lead to a slight "transportation" of the ocr zoning, wich initially is on the bottom right of the page. If only 1cm is "cropped" during the first process, this would mean on re-processing, that another cm is being cropped and the zoning is "cut in half" - therefore the PDF would bring a bad result where the initial process worked perfectly.

My conclusion - if this is really the reason - that in the re-processing, i would need another script, leaving apart the section where the document is first scanned, transformed, cropped etc.

I'll have to check this in the software application itself (Nuance Kofax Autostore) to see if there is some "image processing" i can remove in the re-processing sequence.

Some tests will be made and I'll let you know about the results.
 
update: I removed the "image processing" part from the "re-processing" to avoid cutting edges etc.

Result in the log file :

Code:
Information	OCR:  Extracting data from document page 1...	08/19/2022 08:13:32
Information	OCR:  Performing zoned OCR...	08/19/2022 08:13:32
Information	OCR:  Adding zone RRT ~FRO::%OCRzone%.1~=22-08-0091-01...	08/19/2022 08:13:32
Information	OCR:  Extracting data from document page 2...	08/19/2022 08:13:32
Information	OCR:  Performing zoned OCR...	08/19/2022 08:13:32
Information	OCR:  Replacing Zone RRTs...	08/19/2022 08:13:32
Information	OCR:  Replacing RRT ~FRO::%OCRzone%.*~=[COLOR=#4E9A06]22-08-0091-01[/color]...	08/19/2022 08:13:32
Information	OCR:  Replacing RRT ~FRO::%OCRzone%..pages~=1...	08/19/2022 08:13:32
Information	OCR:  Replacing RRT ~FRO::PagesCount~=2...	08/19/2022 08:13:32
Information	OCR:  Completed processessing of file C:\AC\TEMP\{CB7FE1F4-802F-4c5f-9A56-BA0B3EFDD244}\recovered scan.pdf	08/19/2022 08:13:32
Information	OCR:  1 of 1 documents are successfully processed	08/19/2022 08:13:32
Information	OCR:  Exiting...	08/19/2022 08:13:32
Information	Memory status  WorkingSetS: 94425088, PageFileUsage: 54054912, PageFaultCount: 41300, PeakWSS: 94429184, PeakPfU: 54054912, HandleC: 1036.	08/19/2022 08:13:32
But then,
Code:
Information	>>> Entering VB script ===============================================>	08/19/2022 08:13:32
Information	>>> [COLOR=#CC0000]Received text zone: ~FRO::%OCRZone%.1~[/color]	08/19/2022 08:13:32
Information	>>> Text zone after space trim: ~FRO::%OCRZone%.1~	08/19/2022 08:13:32
Information	        - no match found!	08/19/2022 08:13:32

I mean.. how... why.. what ? It tells me in part one that it finds the OCRZone wich is 22-08-0091-01. That's the match. And then it "enters vbs script" again? and this time it replaces the value by the initial text wich is ~FRO::&OCRZone%.1~

Like there would be a syntax issue, with too many "" or ~~ , making it use the raw text instead of the value stored inside the variable. [hammer]

TL;DR - wasn't the cropping margins situation, still looking for the epiphany.

 
Hi TomFra,

The script you posted contains only 2 subroutines:
1) MyScript_OnLoad
2) MyScript_OnUnload - but this is empty.

The subroutine MyScript_OnLoad works with the variable ocrzone. However, it is not clear from the code where the subroutine MyScript_OnLoad takes the variable ocrzone from ...

There should be other part of the VBscript, which sets the variable ocrzone and then calls the subroutine MyScript_OnLoad, or i cannot think of a different way, how the subroutine MyScript_OnLoad could be called from an other program during the scanning process.
 
Hi Mikrom,

I'll see if I can find out somehow.

Thank you for your time and efforts !

Tom
 
Hi TomFra,

I found some doc here ... maybe it's the system you are using:
In the documentation, there are screenshots, the script is defined in the tab General
and the fields for the script are defined in the tab Fields. You can try to look there, how your field ocrzone is defined.

But IMO, the best would be to ask the support of your software, or maybe there is a specialized forum (for example here ) where you can ask your questions, or maybe they have mailing list.
 
Hi Mikrom,

thank you very much!

I was looking here because It felt like a script-problem but that's not even sure.

I'll check the docs you linked and see if the community around that software might help, then.

Thank you !

Tom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top