Function to Detect Non Ascii Characters Needed 2

stanlyn · Sep 21, 2012

Hi,

I need a routine that can quickly detect if a string contains any non-ascii characters which comes from field corruption. Below is a copy-paste of a string that is killing ODBC. I'd like to test it before handing it over to SPT.

These strings can be up to 4000 characters long and there are millions of records, therefore iterating character by character through the string and testing each character for non-ascii would work, but highly impractical because it would take forever.

Actually, I'm getting about 100 of such corrupted fields for every 200,000 records. Any suggestions?

Thanks, Stanley

Actual data from a failed command...

INSERT INTO [DEEDROOM].[dbo].[DriLog] ( [county_ID], [ENTRY_TYPE], [SESSION_ID], [CREATOR], [ENTRY_time], [ENTRY_PID], [county_PAGEKEY], [Page_Type], [Book_Num], [Page_Prefix], [Page_Num], [SubPage_NUM], [SCANERNAME], [RESOLUTION], [TOP],

, [WIDTH], [HEIGHT], [COLORDEPTH], [DUPLEX], [DESPECKLE], [DESKEW], [AUTO_TRIM], [OCR], [INVERT], [IMAGE_PATH], [COMPUTER], [sync_time], [is_deleted] ) VALUES ( 'KY025', '㄀', 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '01/01/0001 00:00:00 AM', '㄀㄀㄀㄀㄀㄀', 'KY025㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', 0, 0, 0, 0, 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', 0, 0, 0, 0, 0, 0, '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀㄀', '09/21/2012 16:31:40', 0 )

Olaf Doschke · Sep 23, 2012

And Stanlyn,

while the unwanted chars get problematic in the textmerge, the code of interest is the code filling your UPLOAD table, that should already filter out unwanted records. This is where you get your chr(0) and other gibberish out of your initial input file.

Bye, Olaf.

stanlyn · Sep 23, 2012

To all,

Sorry about all the confusion, of which I am more confused now that when I first started. Let me try to explain. This thread started out with a question about (1) quickly determining if a string has bad/unprintable characters in it. Then we start talking about (2) textmerge and parameters, then we turned to (3) all the translations that are possibly involved, and finally (4) filtering. The benefit for me is learning much more than what I originally asked for. If you see a better way of doing things, then please, DO go off on a tangent.

I get the part about "IF CHR(0) $ yourfield", but that would take forever to process. I mentioned this fact at the top of this thread. At this point we are thinking its chr(0), (has anyone confirmed that yet), and what about all the other possibilities that are unprintable, as that only adds to the processing time.

Early in this thread, I stated that the fastest method would be to deal with it once odbc complains, then look through the offending record's fields using Mike's code. No one has seconded this motion yet. I don't think that the variable Mike creates contains all possible bad characters and or corruption that may be found in the string, after assuming binary data has been kept out as binary data would certainly contain a much larger list of garbage characters that odbc would complain about.

I can easily handle the testing of each field on either the original table or the intermediate table as I know what data type each is to contain.

>> 3. TEXTMERGE has the same prroblem as source code you put together a string variable, but you could also see that as merging into a textfile, which then is read into a string. And it's a source code textfile.

>> The parametisation of SQL is on top of that to optimise performance and to avoid technical problems you have with text merging. ODBC would pass a string as is via parameter, and the sql command itself will contain a placeholder parameter name. This is translated by ODBC mechanisms beyond the scope of this discussion.

Olaf, I'm understanding you as saying that I should not use textmerge, instead I should use ?params some other way. I thought that I have to use textmerge as the assembler of the sql command that will be passed thru to odbc and finally to sql. If you are NOT using textmerge with your ?params to do it the way you have suggested, then be specific on how to do this as I don't have a clue...

Earlier, I asked, with no direct confirmation...
So, in an textmerge sql update statement,
Instead of:
[sync_time] = '<<wd>>',
[is_deleted] = <<lnIsDeleted>>

I'd use:
[sync_time] = ?wd,
[is_deleted] = ?lnIsDeleted

Yes or No? If no, please explain...

>> If you have a problem, that might bne because of the scope.

Cannot imagine it could be scope because it processes 199,900 without issue out of 200000 on average, (100 errors out of 200,000).

As a result of this thread, I'm leaning toward (1) creating a sanitize script that processes each VFP table looking for bad and data corruption, either nulling or whatever to indicate bad data was there as well as logging them. Then (2) send it SQL via sqlexec trapping errors associated with bad data, and output the string to a file for further investigation.

Thanks,
Stanley

Mike Lewis · Sep 24, 2012

Stanley,

This demonstrates the difficulties of using the same thread for several themes (albeit related).

Let's focus for a moment on the original question - and, in my opinion, the simplest solutions.

You say: I get the part about "IF CHR(0) $ yourfield",... top of this thread. At this point we are thinking its chr(0), (has anyone confirmed that yet), ....

We are waiting for you to confirm that. You have the data. We don't.

The question is: Can you rely on the fact that, where the data is corrupted, it will always contain a binary 0?

If yes, then use the $ or AT() function. It will instantaneous. End of story.

If no, then use the code I originally posted. You say the objection is that it will "take forever". It won't. I know how much data there is, but the processing time is likely to be very small compared to the time taken to retrieve the data in the first place.

You can make a couple of improvements to my code to speed it up:

1. Re the loop that's building the lcLowChars string: instead of doing this each time you do the test, do it once only, at initialisation, and keep lcLowChars in scope after that. (You probably already realised this.)

2. Instead of testing lcStripped <> lcInput, test LEN(lcStripped) <> LEN(lcInput). Comparing two integers is clearly much faster than comparing two long strings.

That still leaves the CHRTRAN(). I realise that involves iterating a very long string. But that's done in low-level, highly-optimised code. I haven't tried to time it, but I believe it would be extremely fast, even with the volumes that you are working with.

Finally, you say: "I don't think that the variable Mike creates contains all possible bad characters and or corruption that may be found in the string".

It's up to you to determine what constinutes a "bad character or corruption". At the outset, we talked about non-ASCII characters. If there are other characters that you know always result from corruption, then just add those to the lcLowChars string. Similarly, if you know there are any non-ASCII characters that will never occur in the corruption, remove those from the string.

I'll leave to others to discuss the ODBC and TEXTMERGE issues.

Whatever approach you finally take, I hope you manage to find a solution.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

Olaf Doschke · Sep 24, 2012

Thanks, Stanlyn

That get's several points straight and gives this thread a better aim again.

As a side note: If you this code is in use for a long time and you encountered the problematic chars just this time, the very simple explanation may be a file upload failed and just would need to be repeated, to eg get a uncorrupt UPLOAD.dbf. Nobody actually asked, if your code is old and just having problems with current input files or if the process is new and yet to be established.

I think we should address the detection of bad data first, then straighten the textmerge usage. So in this post I'll address the garbage problem.

You still didn't get that my CHRTRAN code will care for chr(0) and other chars in one go(!). Please reread my first post, especially it's code section. You see I add 32 chars as unwanted chars into a variable lcUnwantedChars, from chr(0) to chr(31). You may take out chr(9), chr(13) and chr(10), as they are TAB, CR and LF, not priuntable but surely allowed control characters. This may again cause a textmerge problem, but that topic again is postponed.

So when I talked of filtering (4) binary data, that was not meant as a new step to take, it merely was another word to (1) quickly determining if a string has bad/unprintable characters in it, so topic 4 = topic 1. Your points 2 and 3, the textmerge problem and 4 the translation of codepages will be addressed later. 4 is not a problem once you read in your data in ways not doing a translation, but all the explanations about translations only were made to show you how you get from your original file containing chr(0) to ? in vfp. As another exercise do _CLIPTEXT = chr(0) and then try to paste that in VFP. Nothing will get pasted, not a ? nor a box char, the text cursor isn't even moving a char. In contrast do STROTOFILE(chr(0),"test.bin") and then lcFilecontent = FILETOSTR("test.bin") and ? asc(lcFilecontent).

OK, now for the filtering /detecting unwanted chars: In short the above code will help you. It's not true that it "would take forever to process" even if you do it on all single fields instead of the sql command string after you put that together, as putting together the sql command string also takes time. You can simply loop from 1 to fcount() over field(i) and check with chrtran for unwanted chars, if the field type is cahracter, you don't even need to hardcode field names, you can test the type with TYPE(Field(i)), so you can make this generic code.

You can also invert the chrtran logic, and create a variable lcWantedChars to contain all allowed chars, Then check if chrtran(field,allowedchars,"") results in an empty string, because if there is something remaining it wouldn't be a wanted char. Don't forget to include space and other seldom but valid chars.

As soon as you have this and report back what you find, I will address the textmerge. As a first answer: Yes, that is the syntax, simply ?fieldname or ?variablename. You had that right, so I didn't answer because you had that right, I'm still wondering why you don't simply let foxpro answer that question? Simply do that and you will see. The paragraph about variable scope was not about your current code, but the parameterised insert, I thought you had already changed and was just addressing problems you could have with it, as with parameterised queries the scope of the passed variables is of importance, not with your current textmerge. With your current textmerge you just have the values in the resulting string, the variables coul be lost afterwards, it wouldn't matter, as their values are embedded, copied into the sql comand strin via textmerge. Enough, this is delayed.

Please first address the filtering/detection of unwanted chars and come back, if you have questions about the solution.

Bye, Olaf.

mmerlinn · Sep 24, 2012

[ ]

My first line of attack would be the test for CHR(0) as I suggested in my last post. That should solve most if not all of the problem that Stanley originally presented to us. That would also get the pressure off so he can solve the real problem without constantly being sidetracked with fires to put out. If perchance the first line of attack did not solve the whole problem, it would reduce the size of problem set and make it easier to locate other characters that could be giving problems.

After removing the pressure cooker, I would go looking for the issue of WHERE the corrupted data is coming from. Solving the cause of the problem would stop any future issues.

Basically, I see this as two issues. Remove as many distractions as possible, then solve the underlying problem.

mmerlinn

http://mmerlinn.com

Poor people do not hire employees. If you soak the rich, who are you going to work for?

"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Raymond

stanlyn · Sep 24, 2012

Hi all,

>> if your code is old and just having problems with current input files or if the process is new and yet to be established.

The data has been accumulating since 1998 and is comming from 7 different locations, so how the corruption got there is anyone's guess.

>> You still didn't get that my CHRTRAN code will care for chr(0) and other chars in one go(!).

Yes Olaf, I did get that, sorry for not communicating that to you. I mentioned using Mike's suggestion, but I saw that your suggestion actually expands on his to include many chars. I was in doubt that this alone will catch all possible "non-printable characters".

Olaf, I'll do your exercises and get back.

Assuming all the errors are fixed, I'd still like to see you take on the textmerge vs ?params I asked about.

More later today,
Stanley

Olaf Doschke · Sep 24, 2012

Well, char 32 is space, chars after that are all printable, but still some may be unwanted. Using a loop from 0 to 31 is just an estimate, you can do as you like there, it's just an example. I already suggested to go the opposite way and check for wanted chars. It's up to you. Simply see if that catches all unwanted records and no further.

Bye, Olaf.

stanlyn · Sep 24, 2012

Olaf, I see that you answered my last question about the textmerge. Sorry for not finishing the read before posting, its that I'm under some other pressure and did not have the time to address all thats been said. I'll do that later today when I can concentrate on it. I'll report back...

Thanks,
Stanley

Olaf Doschke · Sep 24, 2012

>The data has been accumulating since 1998 and is comming from 7 different locations, so how the corruption got there is anyone's guess.

Then why don't you simply ask for a reupload of this days/weeks/moths data. Removing corrupted records is not solving the problem really, is it? If the data got corrupted in an upload removing the records not readable still means missing lots of data.

I would actually not care for the error then and redo with correct data.

What you're doing overall here may add to the stability of the import process, but you'd also not take the uncorrupt pixel rows of an image, you'd discard it overall and ask for resubmission.

Bye, Olaf.

Mike Lewis · Sep 24, 2012

Stanley said:
I mentioned using Mike's suggestion, but I saw that your suggestion actually expands on his to include many chars. I was in doubt that this alone will catch all possible "non-printable characters".

Both bits of code - Olaf's and mine - handles all non-printable characters, and any other characters that you wan to include or exclude.

Remember, non-printable characters are those in the range 0 - 31, and also 127. (Admittedly, that also includes line-feeds, carriage returns, and tabs, which are probably legitimate characters, but those can easily be excluded.)

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

stanlyn · Oct 4, 2012

Hi all,
How do get to the data for testing for chr(0) in this sanitize script?

Code:

Clear
Close Tables All
open database 'Books'
lnTables = Adbobjects(gaTables, "TABLE")
For i = 1 To lnTables
	*?gaTables(i)
	Use (gaTables(i)) In 0
	Select (gaTables(i))

	lnFields = Afields(gaFields)
	For x = 1 To lnFields
		*?gaFields(x, 1)
		*If Chr(0) $ (gaFields(x, 1))

			lnCount = Ascan(gaFields, Chr(0), gaFields(x, 1))
			If lnCount <> 0
				?(gaFields(x, 1))
			Endif
		*Endif
	Endfor
	Use In Select(Dbf())
	Wait
Endfor

Dan Freeman · Oct 4, 2012

The only thing data-specific about that code is the name of the database. Substitute one of your own.

stanlyn · Oct 4, 2012

Hi Dan,

>> The only thing data-specific about that code is the name of the database. Substitute one of your own.

Well, thats the point of using Adbobjects() which generates the tables, and afields() that generates the field names. They iterate thru the database generating the table names along with their field names. Now that I have the table and field names, I need to test the value of each field to see if it contains chr(0) or whatever. How do I test the data in those fields? Thats the part I'm missing... I'm also know that I will need to also test the field for character types as in gaFields(x, 2).

Thanks, Stanley

Dan Freeman · Oct 4, 2012

You've already posted code that does exactly that.

Code:

Chr(0) $ (gaFields(x, 1))

This will test whether Chr(0) is contained in the field named in gaFields(x,1).

You'll have to uncomment the lines that do the deed, and you may need the code to do other things as well, but you're asking for what you've already got.

stanlyn · Oct 4, 2012

Hi Dan,

>> You've already posted code that does exactly that.

Not true, that why I'm asking. The code:

Code:

if Chr(0) $ (gaFields(x, 1))

always returns .F. because the chr(0) is not contained in what (gaFields(x, 1)) returns which is the name of the field, NOT the value contained inside the field. How do I get to the data (values) inside the fields. I originally tried what I commented out and after seeing that it was only returning the field names in the array and not the data.

I've also tried several variations of the array elements in the debugger trying to find the data and I don't see it anywhere. I've also looked at all the array commands and the only one that looked like it may be used is the ascan() command. My code above shows me attempting to use it, but I kept getting the data type error on the chr(0) part of the command.

Here is the original code...

Code:

Clear
Close Tables All
lnTables = Adbobjects(gaTables, "TABLE")
For i = 1 To lnTables
	Use (gaTables(i)) In 0
	Select (gaTables(i))

	lnFields = Afields(gaFields)
	For x = 1 To lnFields
		If Chr(0) $ (gaFields(x, 1))
		   ?(gaFields(x, 1))
		Endif
	Endfor
	Use In Select(Dbf())
	Wait
Endfor

Mike Lewis · Oct 5, 2012

Stanley,

Instead of this:

Code:

if Chr(0) $ (gaFields(x, 1))

do this:

Code:

if Chr(0) $ [b]EVAL([/b]gaFields(x, 1)[b])[/b]

As you said, the gaFields array contains the names of the fields. By using EVAL(), you get the actual values.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

stanlyn · Oct 5, 2012

Hi Mike,

I just tried your suggestion and am getting "Function argument value, type, or count is invalid. error message. Any other suggestions? Here is my pasted version.

Code:

Clear
Close Tables All
lnTables = Adbobjects(gaTables, "TABLE")
For i = 1 To lnTables
	Use (gaTables(i)) In 0
	Select (gaTables(i))

	lnFields = Afields(gaFields)
	For x = 1 To lnFields
		If Chr(0) $ eval(gaFields(x, 1))
			?(gaFields(x, 1))
		Endif
	Endfor
	Use In Select(Dbf())
	Wait
Endfor

stanlyn · Oct 5, 2012

Hi Mike,

Your suggestion and code is correct... The type error I got was coming from a field that is not a character field. I found it by adding ?gaFields(x, 1) immediately before the if chr(0) line...

Thanks, Stanley

Mike Lewis · Oct 5, 2012

So, Stanley, are you saying that the original problem has finally been solved?

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

stanlyn · Oct 5, 2012

Hi Mike,

>> So, Stanley, are you saying that the original problem has finally been solved?

Not yet, but this will help and/or eliminate the original problem. Right now I'm saying your last suggestion on getting at the data from the afields() function is what I needed to complete a Sanitize.prg that will be run on the vfp tables looking for any non-printables. I have it working now using only the chr(0). Next, I'm adding the full unprintable char(0-31) less 10, 13, and some others.

Thanks again Mike,
Stanley

Here is what it looks like now:

Code:

Clear
Close Tables All

lnTables = Adbobjects(gaTables, "TABLE")
For i = 1 To lnTables
	?'TABLE: ' + gaTables(i)
	Use (gaTables(i)) In 0
	Select (gaTables(i))

	lnFields = Afields(gaFields)
	For x = 1 To lnFields
		?gaFields(x, 1)

		If (gaFields(x, 2)) = "C"  && test field type
			Go Top
			Do While !Eof()
				If Chr(0) $ Eval(gaFields(x, 1))
					?(gaFields(x, 1))
					?Eval(gaFields(x, 1))
					replace (gaFields(x, 1)) WITH 'BAD'
				Endif
				Skip 1
			Enddo
		Endif
	Endfor
	Use In Select(Dbf())
	Wait
Endfor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Function to Detect Non Ascii Characters Needed 2

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor