Function to Detect Non Ascii Characters Needed 2

stanlyn · Sep 21, 2012

Hi,

I need a routine that can quickly detect if a string contains any non-ascii characters which comes from field corruption. Below is a copy-paste of a string that is killing ODBC. I'd like to test it before handing it over to SPT.

These strings can be up to 4000 characters long and there are millions of records, therefore iterating character by character through the string and testing each character for non-ascii would work, but highly impractical because it would take forever.

Actually, I'm getting about 100 of such corrupted fields for every 200,000 records. Any suggestions?

Thanks, Stanley

Actual data from a failed command...

INSERT INTO [DEEDROOM].[dbo].[DriLog] ( [county_ID], [ENTRY_TYPE], [SESSION_ID], [CREATOR], [ENTRY_time], [ENTRY_PID], [county_PAGEKEY], [Page_Type], [Book_Num], [Page_Prefix], [Page_Num], [SubPage_NUM], [SCANERNAME], [RESOLUTION], [TOP],

, [WIDTH], [HEIGHT], [COLORDEPTH], [DUPLEX], [DESPECKLE], [DESKEW], [AUTO_TRIM], [OCR], [INVERT], [IMAGE_PATH], [COMPUTER], [sync_time], [is_deleted] ) VALUES ( 'KY025', '㄀', 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '01/01/0001 00:00:00 AM', '㄀㄀㄀㄀㄀㄀', 'KY025㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', 0, 0, 0, 0, 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', 0, 0, 0, 0, 0, 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '09/21/2012 16:31:40', 0 )

mmerlinn · Sep 21, 2012

I am not sure exactly what you mean by "ASCII characters" since the definition of "ASCII characters" includes 128 different characters of which a large percentage of them are not visible, like the BELL character which is supposed to chime a bell. And NONE of the lower case letters are included in that definition.

Please definitively define exactly what you mean so we can point you in the correct direction.

Definition:

ASCII:
A code that assigns the numbers 0 through 127 to the letters of the alphabet, the digits 0 through 9, punctuation marks, and certain other characters. For example, the capital letter A is coded as 65 (binary 1000001). By standardizing the values used to represent written text, ASCII enables computers to exchange information. Basic, or standard, ASCII uses seven bits for each character code, giving it 27, or 128, unique symbols. Various larger character sets, called extended ASCII, use eight bits for each character, yielding 128 additional codes numbered 128 to 255.

That said, if there is an unwanted character that occurs EVERY time in the corrupted field but NEVER in the good fields, the simplest way I know would be to use something like this where 'n' is the ASCII code for the unwanted character:

[tt]IF CHR

$ field
DO something
ENDIF[/tt]

mmerlinn

http://mmerlinn.com

Poor people do not hire employees. If you soak the rich, who are you going to work for?

"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Raymond

stanlyn · Sep 22, 2012

Hi mmerlinn,

OK, yes you are right about the ascii set. I'm calling bad characters any character that plain text editors would have a problem with, or in my case, odbc is having a problem with the string. Hope that explains it...

>> Poor people do not hire employees. If you soak the rich, who are you going to work for?
Right on, and I don't see why everyone doesn't get this fundamental truth.

Thanks, Stanley

stanlyn · Sep 22, 2012

While doing some tests the hex editor is showing them as 00, and while copy and paste into vfp's command window, they show up as ? marks...

I copied this from the file:
52011. ㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀

and pasting this into the command window I get this:
zz='52011. ???????????'
?CHR(00) $ zz (returns .F.)

Here is what it looks like in the hex editor:
35 32 30
31 31 2E 20 20 20 20 20 20 20 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Mike Lewis · Sep 22, 2012

Stanly,

Most of us would understand "non-ASCII chars" to mean "non-printable chars", that is, those in the range 0 - 31, plus 127. These are the so-called "control characters", like tab, linefeed, page feed, and so on (127 is DEL).

Assuming that's what you mean, the following code should do what you want:

Code:

* Build a string containing all the non-ASCII chars
lcLowChars = ""
FOR lnI = 0 TO 31
  lcLowChars = lcLowChars + CHR(lnI)
ENDFOR 
lcLowChars = lcLowChars + CHR(127)

* Assume lcInput contains the original text

* Strip out the non-ASCII chars
lcStripped = CHRTRAN(lcInput, lcLowChars, "")

IF lcStripped <> lcInput
  * The original text does contain non-ASCII chars
ELSE
  * It doesn't
ENDIF

Hope this makes sense.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

stanlyn · Sep 22, 2012

Hi Mike,

It makes sense... Just thinking that its going to be too slow because 99.9% of the records are OK and checking them would be a huge waste of resources. After thinking about it a bit, maybe the best way is to deal only with the bad ones after odbc generates the error. Any ideas?

Thanks, Stanley

Olaf Doschke · Sep 22, 2012

Indeed ASCII chars include both readable and nonreadable chars, as Mike said control characters.

You sample looks like data was dutifully merged into a prepared statement like
NSERT INTO [DEEDROOM].[dbo].[DriLog] ( [county_ID], [ENTRY_TYPE], [SESSION_ID], [CREATOR], [ENTRY_time], [ENTRY_PID], [county_PAGEKEY], [Page_Type], [Book_Num], [Page_Prefix], [Page_Num], [SubPage_NUM], [SCANERNAME], [RESOLUTION], [TOP],

, [WIDTH], [HEIGHT], [COLORDEPTH], [DUPLEX], [DESPECKLE], [DESKEW], [AUTO_TRIM], [OCR], [INVERT], [IMAGE_PATH], [COMPUTER], [sync_time], [is_deleted] ) VALUES ( '<<data1>>', '<<data2>>', <<data3>>, '<<data4>>', '<<data5>>', '<<data6>>', '<<data7>>', '<<data8>>', '<<data9>>', '<<data10>>', '<<data11>>', '<<data12>>', '<<data13>>', <<data14>>, <<data15>>, <<data16>>, <<data17>>, <<data18>>, '<<data19>>', <<data20>>, <<data21>>, <<data22>>, <<data23>>, <<data24>>, <<data25>>, '<<data26>>', '<<data27>>', '<<data28>>', <<data29>> )

I can't of course say this is the cood to loook for, the string also might be composed without using textmerge, but you get the idea. It's not the final strings you should analyse, it's the source data, that should not contain unwanted codes.

Overall that's not a good way to compose sql to send via sqlexec. You should rather use ?param instead of merging in data into a readable string, then you'd have just one SQL for all your data and would just need to set variables to repeat an insert. That could then also be done using SQLPREPARE. I'll not go into details, as that's already an improvement on the way to use sqlexec in this situation.

But it's important as it's also addressing the problem in that it would go away. To the price you would have this corrupted data in SQL Server.

So there is one more important thing to do, as you want to do it anyway, and check for unwanted chars. You can do that via CHRTRAN:

Code:

* init phase
LOCAL lcUnwantedChars
For lnCode = 0 to 31
   lcUnwantedChars = lcUnwantedChars+chr(lnCode)
EndFor

* loop over data to insert
If LEN(CHRTRAN(lcSQLString,lcUnwantedChars,""))=LEN(lcSQLString)
   * OK, send SQL
   SQLEXEC(...)
Else
   * unwanted chars found
Endif
* endloop

That said there are situiations such data is not corrupted, eg wehan it's binary data, or encoded/encrypted data. Any table, also in DBFs can also contain such "control characters", but it's not control characters, as it's not ASCII data anyaway, it's bytes having meaning. I can see this is not the case in your sample, for example a field you insert into called IMAGE_PATH should perhaps not have such a binary data string, but a readable path string. Nevertheless you should watch out what field types are and what data in them is ok. If there are binary fields a mechnaism to create literal INSERT INTO statement strings is not the way to go to transfer data to a backend.

For bulk insert of data go for CSV, XML, anything you can BULK INSERT from the SQL Server perspecitve, executing a massive list of insert statements is having it's limit with binary data and performance, it's the least attractive solution to insert data. I know for example a very popular MySQLDump does create scripts of inserts as a backup of data, this is really quote an amateur solution, but has it's pro side on working for normal data and from a client side without having access to directories you would need to put in CSV or XML files to be read in more elegantly.

Parameters are again a way to also bring binary data over ODBC, as these are transferred in parallel. The client side ODBC driver will already parse out parameters spoecified by ?param and forward the variable or properties or fields specified by their name in a seperate stream to the server side ODBC endpoint. This also has many advantages in case of transferring datetime values. You don#t have to care finding out a format the remote database understands, ODBC will take in a foxpro datetime value and it will arrive correct as an SQL Server datetime.

Bye, Olaf.

stanlyn · Sep 22, 2012

Hi Olaf,

>> I can't of course say this is the cood to loook for, the string also might be composed without using textmerge, but you get the idea. It's not the final strings you should analyse, it's the source data, that should not contain unwanted codes.

It was composed with textmerge's text-endtext. You do make a valid point of dealing it at the field level and at least I can empty the field and odbe would be happy. Plus that data is bad anyway. I'll create a script that iterates through all the fields in the vfp table emptying bad fields and logging.

>> If there are binary fields a mechnaism to create literal INSERT INTO statement strings is not the way to go to transfer data to a backend.

How would you transfer binary data to the backend? With a ?param??? Please explain.

Thanks for the ?param idea. I have used it before, but was not totally aware of what was going on in the background. Thanks fro bringing that back up. I still need to make a vfp data sanitizer routine that replaces bad data fields with something appropriate that indicates bad data while keeping odbc happy.

Thanks, Stanley

Olaf Doschke · Sep 22, 2012

>How would you transfer binary data to the backend? With a ?param???
Yes, with a param. What should I explain?

You can store a binary into a variable, depending on the source. Also you can pass ?alias.field from a dbf having a binary field, ? is not limited to variables, you can also pass fields, properties, any name. You will just fail on passing an object as in ?_screen or ?loObject, that won't be passed to SQL Server, but that limitation is acceptable, isn't it.

You can still use the CHRTRAN code to identify unwanted chars and you can do that on the overall level or with the seperate field values. And then you can sanitize those with empty values or default values, you might also pass NULL either via param or as string.

Is there still an open question?

Bye, Olaf.

Mike Lewis · Sep 22, 2012

Stanley,

You say that the hex editor shows this:

Code:

35 32 30
31 31 2E 20 20 20 20 20 20 20 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

But you also say that CHR(00) $ zz returns .F. Are you about that? The hex editor shows the actual ASCII values, and the data in your example definitey contains ASCII 0.

If CHR(0) really returns .F., you should find out what all the 00s in the data actually contain. Do this:

? ASC(RIGHT(lcData, 1))

If it is 0, then your CHR(0) $ zz is all you need.

If it isn't zero, find out what it is, and whether that particular value is always present in the non-ASCII characters in your data. If it is, test for that value in place of CHR(0). There should be no performance issues with that, despite the large amount of data involved.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Sep 22, 2012

Stanlyn,

If I copy 52011. ㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀ from your post I also get question marks in VFP. But that's just because whatever editor you used to open the file containing 0 bytes represents them with ㄀, which in itself is a char neither in the ansi codepage nor being chr(0). Do ? chr(0) in vfp and nothing prints on the screen, not even ㄀.

So if you copy from an editor you at least have the translation of the file to the editor, of the edit to the clipboard and of the clipboard to the vfp editor. If you want to read the file, then go the straight way, not via some editor, neither notepad, nor wordpad nor any other. The hex editor shows you the truth for sure. Check via
lcFile = FILETOSTR(getfile())
? chr(0) $ lcFile

There are really 0 bytes in there.

Bye, Olaf.

stanlyn · Sep 22, 2012

Hi,

Here is the actual text merge statement that does the insert:

TEXT to m.pcSqlStr textmerge NOSHOW PRETEXT 15
BEGIN TRANSACTION
insert into [Deedroom].[dbo].[Page]
(
[county_pagekey],
[county_id],
[page_type],
[page_time],
[BOOK_NUM],
[page_prefix],
[page_num],
[SubPage_num],
[SECURE_KEY],
[CREATOR],
[ENTRY_TYPE],
[ENTRY_PID],
[MONEY1],
[MONEY2],
[CREATION_time],
[IMAGE_PATH],
[KEYWORDS],
[OCRABLE],
[OCR_COLUMN],
[raw_OCRTEXT],
[OCR_DONE],
[OCR_ERROR],
[FILE_TYPE],
[OCR_time],
[CERTIFY_CODE],
[DESKEWED],
[DESPECKLED],
[MEDN_TODO],
[MEDN_TOTAL],
[UPD_BY],
[UPD_time],
[sync_time],
[SHORT_DESC],
[KEYW_AUTO],
[PVA_ID],
[EXCLUDE_WEB],
[LIB_SORT],
[SEC_ID],
[is_deleted]
)
Values
(
'<<cpk>>',
'<<pcCountyID>>',
'<<dt>>',
'<<dd>>',
'<<bn>>',
'<<dp>>',
'<<ALLTRIM(UPLOAD.Chr20Str01)>>',
'<<ALLTRIM(UPLOAD.Chr20Str02)>>',
'<<ALLTRIM(UPLOAD.Chr20Str03)>>',
'<<ALLTRIM(UPLOAD.Chr20Str04)>>',
'<<ALLTRIM(UPLOAD.Chr1Str01)>>',
'<<ALLTRIM(UPLOAD.Chr20Str05)>>',
<<UPLOAD.Currency01>>,
<<UPLOAD.Currency02>>,
'<<cd>>',
'<<ip>>',
'<<kw>>',
'<<ALLTRIM(UPLOAD.Chr1Str02)>>',
'<<ALLTRIM(UPLOAD.Chr1Str03)>>',
'<<rawt>>',
'<<ALLTRIM(UPLOAD.Chr1Str04)>>',
'<<ALLTRIM(UPLOAD.Chr1Str05)>>',
'<<ALLTRIM(UPLOAD.Chr1Str06)>>',
'<<od>>',
'<<ALLTRIM(UPLOAD.Chr20Str07)>>',
'<<ALLTRIM(UPLOAD.Chr1Str07)>>',
'<<ALLTRIM(UPLOAD.Chr1Str08)>>',
<<UPLOAD.Numeric01>>,
<<UPLOAD.Numeric02>>,
'<<ALLTRIM(UPLOAD.Chr20Str08)>>',
'<<ud>>',
'<<wd>>',
'<<sd>>',
'<<kb>>',
'<<ALLTRIM(UPLOAD.Chr99Str04)>>',
'<<ALLTRIM(UPLOAD.Chr1Str09)>>',
'<<ls>>',
'<<ALLTRIM(UPLOAD.Chr99Str05)>>',
<<lnIsDeleted>>
)
COMMIT TRANSACTION
ENDTEXT

And when an error occurs, I output the value of pcSqlStr into a file that I can paste into SSMS directly and determine what went wrong. This chr(00) is detected immediately as a regular text editor has problems opening it. I use EmEditor and it suggests that I open it in its hex editor and that's how I got the hex representation above. If I open it normally (by force, against EmEditor's suggestion, we what I posted first.

So because of all the translations involved, I zipped up the output file and have included it here. I'm assuming zipping it will preserve it. This file was created with vfp9 and strtofile(), therefore I don't know if any translations is going on here or not...

Thanks for looking and suggesting...
Stanley

stanlyn · Sep 22, 2012

Olaf,

So, in an textmerge sql update statement,

Instead of:
[sync_time] = '<<wd>>',
[is_deleted] = <<lnIsDeleted>>

I'd use:
[sync_time] = ?wd,
[is_deleted] = ?lnIsDeleted

Is this what you're saying?

Thanks,
Stanley

stanlyn · Sep 22, 2012

Olaf,

And the benefits would be that

1. odbc would pass it as is with no conversions,
2. good data, bad data, if its legal on the sql side, it goes..
3. ?
4. ??

Right?
Stanley

Mike Lewis · Sep 23, 2012

Stanley,

Did you read my post above? Even if my suggestion was of no interest to you, it would be useful to have your answer to the point about CHR(0). If I'm right about that, it would be a much simpler solution.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Sep 23, 2012

The answer I wrote on 20 minutes is lost.

Ok recap in short:

Stanlyn, I already answered most of this.

1. In regard to EmEditor this means what I also already said, it's one of those editors displaying nonprintable chars with a box char. Which in turn foxrpo can't display and turns to ?.
2. Foxpro string VARIABLES can contain any char, also chr(0), but not string literals, anything you put in source code between string delimiters. A way to get chr(0) inside a string variable is to use chr(0) in code.
3. TEXTMERGE has the same prroblem as source code you put together a string variable, but you could also see that as merging into a textfile, which then is read into a string. And it's a source code textfile.

Like you can't put chr(0) into an editor as a single char instead of the expression chr(0), Textmerge can't, too. Think of chr(13) = carriabge return foir that matter. If that is inside a field of the update table you create code containing a linebreak within a string. You thre in the term "sanitize". If you really know what it means, then you should know it's there to prevent such problems, not only a single quote char inside data would be a problem, also control chars like chr(13) or chr(0) or anything unprintable are a problem you need to care about, ODBC does blindly forward binary cahrs too. So you won't have the problem of the sqlexec to fail, but it will not filter unwanted data away.

My main point is the CHRTRAN, for the third time.

The parametisation of SQL is on top of that to optimise performance and to avoid technical problems you have with text merging.

ODBC would pass a string as is via parameter, and the sql command itself will contain a placeholder parameter name. This is translated by ODBC mechanisms beyond the scope of this discussion. Read books about that. ODBC will also translate to ODBC types, but as I already said it will even work with binary data, so it will not interfere with translating text code pages or such. It will translate eg foxpro integer to ODBC integer, but their value ranges are identical, so no harm, the same goes for most any other stuff. You'll see if datetime arrives 1:1 or is rounded to seconds instead of milliseconds. You'll see. But the concerns of ODBC type translations are much less than the problems of textmergin partially binary or corrupted data into a sqlcommand string. VFP is even not standing in the way, it's SQL Server not accpeting nonprintable chars within a sql command string. "Insert into table (stringfield) Value ('stringvalue')" is not supposed to have any binary data as stringvalue and SQL Server prevents that. So if you want to bring in that data then pass it via parameters.

But if your main point is about filtering that data out, I point you towards CHRTRAN for the third time now. This whole thing about Parameterisation is good and helps preventing technical problems, exceptions and errors, but without filtering will just transfer the gibberish into sql server. It's still recommended to avoid sqlinjections, no matter if they are just randomly occurring or planned.

If you have a problem, that might bne because of the scope. You have code now that does a textmerge, if the variables you use their are not in scope while you do the sqlexec the simple change to a parameterised sql command string will not be sufficient, you also will need to either do the sqlexec iun the same place or make variable available as eg object properties.

Bye, Olaf.

Olaf Doschke · Sep 23, 2012

Sorry for the typos, I was a bit in anger about my lost post and writing this down quite fast.

Bye, Olaf.

mmerlinn · Sep 23, 2012

Stanley, you are making this so much harder than it actually is.

We have already established that you have one or more instances of unwanted CHR(0) in your data and it appears that all corrupted data has that character in it.

Just go back to the place where you were going to loop through your field looking for bad characters.

Then instead of looping through the field, do as has been suggested twice before:

[tt]IF CHR(0) $ yourfield
DO something
ENDIF[/tt]

Unless there is something that you have not told us, this should solve your original problem.

If for some reason it does not, then come back with up-to-date information to help us out.

mmerlinn

http://mmerlinn.com

Poor people do not hire employees. If you soak the rich, who are you going to work for?

"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Raymond

Mike Lewis · Sep 23, 2012

We have already established that you have one or more instances of unwanted CHR(0) in your data and it appears that all corrupted data has that character in it.

Exactly. That was the point I was trying to establish.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Sep 23, 2012

I'm with both of you that solving the filter problem is the main issue. Sorry, that I distracted from it. But even if chr(0) is in one or more samples of corrupted data, I strongly would advise to check for other unwanted chars, too. The check for chr(0) might filter out 99% of the cases, but if you can define more unwanted chars or also the wanted and accepted chars, you can chek for them with CHRTRAN(), see my first post here.

Bye, Olaf.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Function to Detect Non Ascii Characters Needed 2

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor