Function to Detect Non Ascii Characters Needed 2

stanlyn · Sep 21, 2012

Hi,

I need a routine that can quickly detect if a string contains any non-ascii characters which comes from field corruption. Below is a copy-paste of a string that is killing ODBC. I'd like to test it before handing it over to SPT.

These strings can be up to 4000 characters long and there are millions of records, therefore iterating character by character through the string and testing each character for non-ascii would work, but highly impractical because it would take forever.

Actually, I'm getting about 100 of such corrupted fields for every 200,000 records. Any suggestions?

Thanks, Stanley

Actual data from a failed command...

INSERT INTO [DEEDROOM].[dbo].[DriLog] ( [county_ID], [ENTRY_TYPE], [SESSION_ID], [CREATOR], [ENTRY_time], [ENTRY_PID], [county_PAGEKEY], [Page_Type], [Book_Num], [Page_Prefix], [Page_Num], [SubPage_NUM], [SCANERNAME], [RESOLUTION], [TOP],

, [WIDTH], [HEIGHT], [COLORDEPTH], [DUPLEX], [DESPECKLE], [DESKEW], [AUTO_TRIM], [OCR], [INVERT], [IMAGE_PATH], [COMPUTER], [sync_time], [is_deleted] ) VALUES ( 'KY025', '㄀', 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '01/01/0001 00:00:00 AM', '㄀㄀㄀㄀㄀㄀', 'KY025㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', 0, 0, 0, 0, 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', 0, 0, 0, 0, 0, 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '09/21/2012 16:31:40', 0 )

Mike Lewis · Oct 5, 2012

Stanley,

Glancing at your code, it seems there is a big inefficiency in it.

If I've understood it right, you are looping through an array containing a list of the tables in the database. For each table, you are looping through the list of fields. And for each field, you are looping through each record in the table.

I think it would be better to reverse the nesting of the final two loops. In other words, loop through the records in the table. For each record, loop through the fields. That way, you only have to do one pass per table, rather than one pass per field per table.

One other small point: Do any of the tables contain memo fields? If so, then you need to look for "M" as well as "C" when testing the field type.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Oct 5, 2012

As Mike pointed out, your code is performing weak. Even the code I gave you early on (If LEN(CHRTRAN(lcSQLString,lcUnwantedChars,""))=LEN(lcSQLString)) would rather inspect the final code you put together and thereby also test all fields of a record before executing it's insert. Even that would be faster than your current idea.

Now, if you want to find 0 bytes in your whole database another more holistic approach would be much more efficent: Reading DBFs via low level file functions. You need to skip the bytes belonging to non char fields, which is not straight forward, but also not impossible. And a simple way to prevent evaluating those fields is to extract all char fields into a seperate dbf and then you can scan the whole DBF file except header bytes in one go, without reading in single field values. There is the HEADER() function, which will tell you how many bytes of a DBF file to skip to get to the first record, then it's RECSIZE() bytes for each record.

If records contain MEMOs, the DBF will have 4 byte integers representing an offset in the FPT file associated with the DBF, you can care for that seperately, but these 4 byte pointers need to be skipped as you need to skip integers or numeric fields, date and dateteim fields and others, too.

We can also do approaches, that will not tell you exact records corrupt, but just alert a corrupt dbf in the sense of unwanted chars present. So the question is your overall goal: Would you only like to skip single corrupt records or find overall corrupt dbf files? I already warned you partially inserting data coming in is not a good strategy. If a file is sent in via ftp or mail, for example, and for some reason is corrupted in some block of the file, you will be better of having that whole file resent than partially inserting the data coming in correctly.

If you have corruptions due to network failures, you will rather find bursts of lengthy blocks of 0 bytes than anything else.

So depending on that you could also use a rough estimate algorithm not searching for chr(0) in single fields, but for replicate(chr(0),32) $ filetostr(dbffilename), that would be much much faster finding out such larger bursts of 0 bytes, which would only seldom occur as correct data, eg in tables with many integer fields all being 0.

You have the broken files, I would suggest you open one of them up in a hexeditor, you have at least the hexedit.prg coming with vfp to do that, and then look out for bursts of 0 bytes in the dbfs and see if that is a sign you find in them.

Also, if this is all about finding out the transfer of a file succeeded, there are much easier mechanisms with checksums on file level. You won't analyse the inner workings of a file to check for file transfer errors. You would add md5 checksums in one additional file to send and check them against md5 checksums computed from the files directly. Another very simple approach is to send zips, as zips can be created programmatically and extracted programmatically and would fail to be extracted, if they are not transferred intact, which wuold also simply tell you when to do retransfers.

Bye, Olaf.

stanlyn · Oct 5, 2012

Hi all,

For additional clarification...

I ran into this chr(0) issue while developing a synchronizing tool that syncs the vfp data to sql at the record level. Therefore, I'm only interested in the records that fails to sync for whatever reason. I can then deal with them, and for now its the chr(0) and whatever else that could be in a numeric field. In the problem records where chr(0) was found in almost all char fields, the numeric fields has some sort of strange character. I'll test for this later. Also in this group one char field is OK, and everything else has un-printables.

At this point I'm not transferring them via ftp, htm or any thing else, as I got complete copies of the databases from each location and I'm processing them locally for now (while developing and testing. I do not know how this select few of records got the chr(0) in-bedded in them.

The code above that uses the afields() function was written as a simple utility that can sanitize the vfp database from this chr(0) stuff and can be executed as needed and was not intended to be part of the app. I do expect to use some of the bits and pieces talked about here in the final app. And of-course the chr(0) must be expanded to include other issues related to bad data.

Thanks for the "M" data type tip, as I have not yet gotten to it yet. Now that I can interrogate the underlying data, I will add the other bad data options.

In testing the numerics I spoke of above, it turns out they were not numerics, instead they are logicals. It is displaying the data strangely as a fat I, but when printing the value, it prints as .F. Also note that its only the records whose char fields contains mostly chr(0)s. Here is a screenshot. Any suggestion?

More later...
Stanley

stanlyn · Oct 5, 2012

To all,

When trying to find strange characters that causes logical fields to display wierd characters in the browse window using this code I get the function data type error which would be consistent when comparing char data to logical data. Take a look ath the prev screenshot and you can see it all...

Code:

Filecontent = duplex
Clear
For i = 0 To 32
	?Chr(i) $ Filecontent
Endfor

Thanks, Stanley

Olaf Doschke · Oct 6, 2012

Stanlyn,

at this stager it's totally clear your copies are just not correct. Don't try to heal a dead horse. All the code you work on now can later be replaced by checking checksums and rejecting wrong copies. No matter, if they come in from manual file copies or FTP or HTTP.

Your file is obviously not copied correctly or the original file already has these errors. It's unusable data.

To understand the issue with your file, look at it in a hex editor. In the uncorrupted fields, where foxpros browse displays .T. or .F. or perhaps .NULL. (if the field is nullable) or perhaps space (for blank logical fields), in the hex editor (DO Home()+"Tools/Hexedit/Hexedit.prg") you'll see T and F and space (and null is denoted as a set bit in the last byte(s) of the record, but that's not really important here). And this means any other char than T, F or space is causing the browse to display nonsene, while accessing the field it's printed as .F., because it is not T. So VFP doesn't really care if you store F to display .F.

But that's just details. A blind person could see these records are corrupted.

So simply redo copying these databases. It will not help you to skip corrupt records or fields, if you sync from such a corrupted table the data you insert is not complete and therefor even the correct records would not help you. Eg if a parent record of an order is corrupt you can't insert the orderdetail records, even if they are correct. Don't try to mend this problem, if it's just because of corrupt file copy. You can come back to this, if a further copy of the data shows the same corruptions again, so the source data is already corrupted. But then you even need further code to save intact data of records belonging to each other by database relations and that again needs a more holistic approach.

At the moment you're stuck at a detail, that will probably not be a problem ever again, unless source data is already corrupt.

>I got complete copies of the databases from each location
The thing to do for the databases not importing 100% is to ask for complete copies again. Then see, if the secondary copies also show the same corrptions. It's likely you just had a problem while copying or a corrupt transfer or zip file or whatever, because otherwise your external location would report problems when working in the application.

What do you think?

Bye, Olaf.

Olaf Doschke · Oct 6, 2012

Another thing I noticed about the screenshot: Bad records are marked by Creator="BAD". Assumed all bad records have such a marking, wouldn't that be much easier to check?

Where does this drilog.dbf come from? Could also be a clipper DBF instead of a fox dbf. What's the first byte of this DBF? Again DO Home()+"Tools/Hexedit/Hexedit.prg" and just tell us the value of the first byte (address 00000000).

Bye, Olaf.

Olaf Doschke · Oct 6, 2012

Another thing to note: If you want to upload data of whole databases into SQL Server, there is an upsizing wizard in VFP. It doesn't help with your corruptions, but once you made sure your copies of the local stores are intact, you can import all data without programming individual code. Data migration is a problem so common, it shouts for already having been solved. Just one hint: Don't use the native upsizing wizard, use the improved one of VFPX:

http://vfpx.codeplex.com/releases/view/10224

Bye, Olaf.

Mike Lewis · Oct 6, 2012

Stanley,

A small (but important) detail:

The last bit of code you showed contains:

Code:

For i = 0 To 32

The 32 should be 31 (32 is the ASCII code for a space).

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Oct 6, 2012

Mike, that won't help with the error Stanlyn has in this loop, as the type of the duplex field is logical. Therefore Filecontent also is logical and the $ operator fails, no matter if you go up to chr(31) or chr(32).

Bye, Olaf.

Mike Lewis · Oct 6, 2012

I see what you mean, Olaf. In any case, I wouldn't want Stanley to be distracted by such matters of detail.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Oct 6, 2012

Then why do you write about such a detail? I wa already writing about a much more general approach and want to stop stynley from going this route only leading into more detail problems he wouldn't have in the first place, if he just can get intact copies of the databases and then also use standard data migration tools like the advanced VFPX upsizing wizard.

We really all led him a wrong track, because we were not seeing the larger problem. But once you se a thread getting longer than average you should ask for the larger scale problem and get back to a larger scale overview and stop fiddling with details. I address your mistake, because that only would lead to follow up discussions about why limiting the loop to go to 31 also didn't solve the problem. I just want to nip this in the bud, it's not at all important, if stanlyn follows my advice to start again and go to the root of the problem instead of continuing from trunk to branch to perch to leaf.

Bye, Olaf.

Olaf Doschke · Oct 6, 2012

After having calmed down a bit

The only thing I want to say, Stanlyn, is: If your databases comeing from local stores are corrupt, in the first place just repeat the copying of that data, before writing any analysis code. It's understood your current database copies have errors, but that doesn't mean the original files already had the errors.

Just copy the databases again and see if the corruption really is permanent, ideally look into the original dbf files remotely, eg via teamviewer. Does it already show irregularities in the same tables and records?

If there is no problem locally, you also don't need to dive in to code needed for corruption analysis. The ideal way to assure data has come over intact is to compute checksums via md5, store them in eg Table.md5 per dbf or dbc and zip them together with all files, then recheck these checksums after unzipping locally. And if detecting a file with a checksum mismatch repeat to copy it.

If, and only if, it turns out the data of some local places is corrupt already, then you can address this with analysis code, but then you also will need to decide what to do about these local stores, eg replace hardware or at least drives, what data to save and what to take as loss and what related data could reproduce some info or not. But for now you're doing the last step first.

Bye, Olaf.

stanlyn · Oct 6, 2012

All,

This is what know...

I've gotten multiple copies of each locations data by actually visiting the sites and coping the entire database to a portable drive. Also in the past, I've zipped up the database and ftped them to my site. No matter what way I do, the corruption is there and its consistant. I don't have a choice but to salvage these tables as its far better to mark .001 percent of the records bad, print them out, and deal with each one on an individual basis.

The good thing is the tables that has corruption are not part of any parent/child relationships, as they are stand-alone. The approach I'm doing now will work with these non-parent/child tables. It involves doing all the work on the vfp side before sending it to sql. I'm iterating through each record and field replacing bad data with a literal that indicates what the bad data was. For example, if the field has a chr(0) in it, I'm replacing that value with a literal "chr(0)" that tells me that a chr(0) was found here originally, but now the field has a valid string. Originally, when I was just dealing with only a chr(0), I was marking the field as literal "BAD", that Olaf picked up on. Now that I'm dealing with the whole spectrum, I'm replacing the fields value with a literal version of what was wrong.

The tables were originally created by me in 1998 using vfp3, so no clipper here...

Olaf, you are saying that the corrupted logicals values will always be .F., simply because they are not true, because of the corruption shown in the browse.

At this point I uncovered chr(0), chr(146), chr(151), chr(171), chr(163), chr(167), chr(147), chr(145) and others, and this is after some cleanup.

>> there is an upsizing wizard in VFP
Yes, i know. I tried it, and had a lot of conversion issues, so I rolled my own where I have full control. As as you said, it cannot deal with the issues at hand, while the solutions presented can when coupled with my upsizing code.

And please do be detailed as I've learned so much from them...

Thanks, Stanley

Olaf Doschke · Oct 7, 2012

>No matter what way I do, the corruption is there and its consistant
OK, then it's worth to continue finding a way to identify corrupt records.
I think, though, you should not do that by looping over records and fields, we should find a more general approach handling the whole dbf file. It takes too long to loop over all records and fields and the logical field type shows you could see wrong chars in the browse, but eval(logical) will yield .F. anyway, so a wrong byte stored in the dbf file at the logical field position is not detected this way.

>Olaf, you are saying that the corrupted logicals values will always be .F., simply because they are not true
Yes, that's what I see when simply setting a non allowed char for the logical field in the hex editor.

I don't know what you mean by "uncovering" chr(145) etc in a logical field. Via Hex Editor? (Then it would help specifying hex values instead).

A logical field is stored with the length of a bytes, actually, not just a bit as logical .t./.f. would suggest. A byte only is a char in char type fields, otherwise it's a portion of a file storing 8 bits with values from 0-255. So I wouldn't talk about chr(145), I would talk about a byte 0x91.

What you see in a browse window in case of such an unexpected byte value is not what you read from the logical field, or what Eval("field") results in. And your code only cares for the latter now, which is not detecting these faults. You really would need to read the bytes from the dbf file as the hex editor does, which causes no interpretation or evaluation of the bytes, but takes them as is. Then you'd not only visually see wrong data in the logical fields, but also detect wrong byte values.

o illustrate what I want to say with this: I took Hexeditor again and put in hex 91 at the byte position of a logical field (named llogic) of a dbf. Now the browse window shows a backtick char (´). But ? llogic prints a .F. on the screen and eval(llogic) alsoe evaluates to .F.

I also downloaded DBFViewer200 and let it read the "defect" DBF with 0x91 in the logical field. It also simply displays the backtick and reports no error in the field.

So this is what I earlier said, Foxpro does not care about these wrong and unexpected byte values, the field is "intact", as it's field value is still a logical value .f., while the browse visually shows a defect.

And that's not the only reason looping over all data via the normal way to access dbf fields of it's records will not always work, you might later on have more serious defects letting you not even open a dbf file.

What I suggested therefor was to make a rough check by replicate(chr(0),16) $ filetostr("your.dbf"), a burst of 16 0-bytes being in the dbf.

That's just a rough estimate, as 16 0-bytes indicate an error most probably, but can also be correct, eg in 4 integer fields all having stored 0 you get 16 0-bytes, 4 of each integer field.

I hope this clears some of the points I made. You can do your own way of reading a dbf, if you have some meta data about your dbfs stored in a safe place reliably, then you can read in a dbf fully via lcDBF = FileToStr("your.dbf") and then separate this into the header and records. All you need for that is HEADER() and RECSIZE() of an uncorrupt DBF. Left(lcDBF,HEADER()) is the header, the rest of the lcDBF string is the records. You can check header validity without writing your own validation code, as there is SET TABLEVALIDATE and just a USE of the DBF tests it's header more or less depending on the TABLEVALIDATE level you set.

You can then check the records as substrings of the rest of lcDBF. This is the approach I would take not not depend on what VFP reads eg of logical fields. You can really read the bytes store din the dbf file and check them for validity.

Commercial tools like foxfix do it that way, and in the end I suggest you don't roll your own here. After data has been amended, I'm quite confident you would be able to use SQL Server Upsizing wizard, at least as supportive tool and/or take out code from it for your own import tool. Don't reinvent the wheel here, even more so, if you need to learn much stuff, before you can reliably say which conditions point to corrupt data. You would need to know the low level byte composition of all the different foxpro types, if going this route, and as I explained it's the only reliable way.

You can of course stay with such simple approaches as checking chr(0)$charfield, but you don't hve anything but visual inspection by browse, to detect wrong logical field contents, unless you read in the dbf bytes 1:1. If you have a low number of corrupt records, the chr(0) approach may be sufficient, as long as you have no dbf header defects and can still read the corrupt data. And then you perhaps won't need to check other field types like dates or numeric fields, because they seldom would contain the only defect bytes, but a defect in a record will mostly also show in a char field.

We don't know.

You decide on your own and you can complete your solution so far or start with the newer ideas. You have many ingredients and ideas now, but completing this is beyond what a forum thread is for. Your initial question is answered I think and we identified the problem. At this point you can finish your own code or begin a new analysis tool.

I also would suggest, if you have further detail questions, rather start a new thread now.

Bye, Olaf.

stanlyn · Oct 7, 2012

Thanks to ALL for your good detailed explanations in solving this problem.

Thanks for the low level table troubleshooting info and ideas that I'll be expermenting with today. One thing I did notice was that all records that had logical corruption also has char corruption, and in almost all cases the record had no useful data to determine what process created it. Of-course after doing a manual inspection of said record, if there is enough data to recreate we will, otherwise just delete it as there is no way to get the correct data for that record. That would also be true even if you took the low level approach, as the data is corrupt.

Just because its frequency is so low, I'm (1) logging it, (2) replacing the chars with good ones to create a stable record, and (3) manually correct the data or delete the record from the logs.

Also note that when I look at these corrupted records in natural order, they got introduced into the table in the 2003 era as suggested by good records listed immediately before and after the corrupted records. I've not seen any issues from data created since then.

And Yes, I have enough info on solving this threads original mission, and thanks to all.

I consider this thread closed now. Any other comments or questions will start a new thread...

Again thanks,
Stanley

Olaf Doschke · Oct 8, 2012

OK, nice wrap up, Stanley.

As the situation is as you describe, I would also not go the route of low level bytewise analysis of the data. Seeing the last errors occurred in 2003, this also should mainly only be an initial issue. Logging bad records will ensure, you will spot problems in the future.

Bye, Olaf.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Function to Detect Non Ascii Characters Needed 2

stanlyn

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

stanlyn

Programmer

stanlyn

Programmer

Olaf Doschke

Programmer

Olaf Doschke

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

Olaf Doschke

Programmer

Olaf Doschke

Programmer

stanlyn

Programmer

Olaf Doschke

Programmer

stanlyn

Programmer

Olaf Doschke

Programmer

Similar threads

Part and Inventory Search

Sponsor