Wierd Val() Behavior

stanlyn · Apr 10, 2014

Hi,

If lcTempWord equals "6" (char 6) when hitting the code block below, what could cause "Val(lcTempWord)" to return as numeric 0 instead of numeric 6?

After the debugger runs through the code block below the debugger states the following:
lcTempWord has a value of "6", type=C
Val(lcTempWord) has a value of 0, type=N
Val(lcTempWord) > 2 has a value of .F., type=L

code block...
If Val(lcTempWord) > 0

llMatchFound = .T.

Endif

VFP9sp2 on win7-32

Thanks,
Stanley

stanlyn · Apr 11, 2014

Hi Griff,

Would the JustDigits() function keep the digits together such as 2 or 222 or 2222? I'll actually try the function later on today.

Thanks,
Stanley

stanlyn · Apr 12, 2014

Hi Olaf,

>> ? ''+0h6F6633 displays as "of3". Are you sure that's displaying as "6" for you?
>> o and f surely are not invisible chars.

Yes I reported it correctly and I always step thru the debugger a line at a time watching the values change, plus it runs or skips code that is consistent with its values. In other words, if the debugger says a value is .t., the debugger actually steps in and runs the .t. code and skips the .f. code.

>> What's you real world problem?
Working with less than ideal text from ocr generated from low quality (50 year old) documents... The older books are hand written, therefore there is no ocr processing done on them. They have to be hand keyed if full text search is to be achieved.

Thanks,
Stanley

Mike Lewis · Apr 12, 2014

lcTempWord is a result of foxtool's word() function

Are you aware that VFP has native GETWORDCOUNT() and GETWORNUM() functions? What's more, those functions let you specify any delimiters you like for a word? They could perhaps save you a couple of steps.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Apr 12, 2014

>Working with less than ideal text from ocr generated from low quality (50 year old) documents
Also "of3" is clearly not 6 and seems an OCR error missing a space.

I don't know why you would then use VAL() on words. To distinguish numbers from words? Well, as it turns out you have a missing space there between of and 3, but no 6 anymore. Whatever was that hasn't repeated.

The overall solution would be to create a dictionary of words, start with eg the data of

Then put words not found, like "of3", in a list of errata to manually look over including the source of where they were found. And of course any number in a valid format is uninteresting to put into a dictionary. You can analyze numeric strings with regular expressions or with functions like ISDIGIT. If you have a word containing both digits and numbers you may not have a number, but some product code, device name, whatever. It's a candidate for manual inspection.

The mystery of the "6" not being a 6 still remains, but I guess you'll never step over it again. And by the way, Vilhelm-Ion Praisach, your example results in chr(0)+"6" and we had that already and it also doesn't show as "6" in the debugger Locals window. It shows as " 6", so there is no guessing why it's not evaluated as 6. With GetWordnum you would eleminate all spaces anyway, and remaining white space is no tab, space or anything you'd get with OCR, but chr(0) or other nonprintible and nonvisual chars, but the debugger shows a space where there shouldn't be any ba GETWORDNUM trimming. So GETWORDUM is a solution.
I'd rather use ALINES with a "line" delimiter of Space, as that doesn't repeatedly scan the text from begin to the Nth word. It's faster splitting a text in words.

Also see how I did this and indexed a text here: thread184-1553613
Look in the lower third of the thread.

Bye, Olaf.

Olaf Doschke · Apr 12, 2014

I forgot to add:

start with eg the data of...[URL unfurl="true"]http://www.sweetpotatosoftware.com/spsblog/CommentView,guid,8800bdb9-a9c2-484f-942f-6a08947d903a.aspx[/url]

The dictfinal.dbf inside the download has over 100.000 words in it.

You can count the frequency of words found and then put words just found once into an errata list.

Bye, Olaf.

stanlyn · Apr 13, 2014

Hi Olaf,

I'm already using Craig's dictionary from his spell check code. My version has a little over 144000 words in it.

The routine I'm using to determine if a string is really a number is failing at times and I need something that will not fail. Currently I'm adding 1 (one) to the string as in:
tt = Val(lcTempWord)+1
and if tt is greater than 1, its a real number. And if tt equals 1.00, its has characters in it and therefore not a number but a string that contains numbers. I can iterate through the string testing each position for a number and if all positions are digits, then its a number.

A recent failure where this failed was processing the string "01E569177$434.00Surcharges" with a "Variable '00000001: Incorrect function' is not found" error message. Someone, maybe you mentioned there are better ways using expressions to determine this. What would you consider a bullet proof method in dealing with this?

I'll be posting a link to the actual file that contains the "of3" issue as I was able to reproduce. The string is actually shown as "Page 1 of36Lynette BarnettFrom:1-80" in VFP's edit box. This result is after running this cleanup code:

m.cFileName = "C:\temp\ocr.txt"
m.lcOriginalText = '' + Filetostr(m.cFileName)
lcInput = m.lcOriginalText
lcNonPrintableChars = ""
For lnI = 0 To 31
lcNonPrintableChars = lcNonPrintableChars + Chr(lnI)
Endfor
lcNonPrintableChars = lcNonPrintableChars + Chr(34)
lcNonPrintableChars = lcNonPrintableChars + Chr(39)
For lnI = 127 To 255
lcNonPrintableChars = lcNonPrintableChars + Chr(lnI)
Endfor
lcStripped = Chrtran(lcInput, lcNonPrintableChars, ' ')

When I first reported it here in this thread, it was as previously stated with no mention of the 6 as is being shown now after some cleanup processing. I can send the file exactly as the ocr engine created it. Is it still needed?

Oh, I'll be looking at that piece of code from the link you provided in just a few minutes. It was too much for me to examine the other night when you sent it, and I could tell it was rather complex and I'd need some sleep before I could make sense of it.

Thanks,
Stanley

stanlyn · Apr 13, 2014

Olaf,

A copy-paste from the original ocr file is:

Page 1 of3
‘6

And with a cr after "of3"... The cleanup code would have removed the cr. Let me know if original file is needed...

Thanks,
Stanley

Olaf Doschke · Apr 14, 2014

No need to upload the file. If you already stripe off unprintible chars you end up with "Page 1 of36", but still not a separate "6". Doesn't matter much, as it's just one special example.

One thing you MUST do is replace CR (chr(13)) and LF (chr(10)) with " " (Space), not strip them off. Line breaks are word separations too. If you simply replace CR and LF with nothing, you get concatenated words all the time, just if there is a line break. That would already get rid of some bad situations.

You still would have the "of3" not recognized as word nor number. But adding +1 wouldn't gain any more information. VAL() als evaluates scientific notation, eg VAL("1e2") will be 100, that#s your other problem, as a too high exponent leads to errors. Simply DON'T use VAL, instead check if a word contains a digit with ISDIGIT() or remove all digits to see if there are some. You don't really care what number you have there, you care for not having a valid word, don't you? Why use VAL at all?

Code:

IF LEN(chrtran(lcTempWord,"0123456789",""))<LEN(lcTempWord) 
   * There are digits inside lcTempWord
   * put this word in an errata list or ignore it
   * don't look it up in a dictionary, you won't find it.
ENDIF

The VFP like GOOGLE thread I pointed you to earlier doesn't focus on OCR text validation, but the aspect of indexing text for searching.

Bye, Olaf.

Olaf Doschke · Apr 14, 2014

...and just by the way: a tick or backtick as start of a string doesn't show up in the debugger Locals window as obvious as a space or chr(0), so there you have your impression on using VAL() on "6", you dind't. It's as simple as that.

Bye, Olaf.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Wierd Val() Behavior

stanlyn

Programmer

stanlyn

Programmer

stanlyn

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

Olaf Doschke

Programmer

stanlyn

Programmer

stanlyn

Programmer

Olaf Doschke

Programmer

Olaf Doschke

Programmer

Similar threads

Part and Inventory Search

Sponsor