Find and remove non UTF-8 characters

ifmo114 · Jun 5, 2007

I'm reading in data from a file which contains some non UTF-8 characters. I need to write this data to a program which can only handle UTF-8. Does anyone have any tips for finding the non UTF-8 characters and removing them from a string?

Thanks.

stevexff · Jun 5, 2007

UTF-8 is Unicode with all the simple characters mapped back to their original ASCII. So most 'normal' characters are 1 byte, and anything special or unusual is 2 bytes wide. So pretty much anything should be acceptable.

Can you give an example of one of the non-UTF-8 code points that is causing you problems?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

ifmo114 · Jun 6, 2007

see the double quotes around the word second...

the “second” time

This is HEX value 93 and 94 (at least according to my TextPad editor.

MakeItSo · Jun 6, 2007

@ifmo: not quite.

They are values 93 and 94 in HEX, but 147 and 148 in DEZ.
;-)

I suggest a little workaround: if the tool that produced this utf-8 file work improperly or (which is what I suspect) the file has been improperly handled with Search/Replace after being saved as UTF-8, then you could use Search&Replace, searching for character 147, replacing it with "â€œ"
searching for character 148, replacing it with ... crap!
I can't post the code for the other one. best if you save some text with these quotes in utf-8 format, open it as ANSI and then copy over that combination.

Hope this helps.

Cheers,
Andy

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

ifmo114 · Jun 6, 2007

Just to clarify, I'm pulling data with these characters out of an Access db table (actually a linked table to SQL Server) and writing it to a txt file.

How can I search for DEZ (I'm not sure what that actually stands for)?

stevexff · Jun 6, 2007

DEZ = decimal

We need to check the source, and work through the chain until we find where the bad conversion occurs.

Look at the SQL Server table definition. Are the columns with the problem data defined as NCHAR, NVARCHAR, or NTEXT?

Use Query Analyzer to select a few rows - do they look OK on the screen?

Then check the Access Table - do they look OK there?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

ifmo114 · Jun 6, 2007

The data is written to the SQL Server by a custom tablet PC application and is beyond my control. All I have is read access to the table which I'm doing within a VBA program. I then have to write the data to a file is loaded by another external appliation that fails on non UTF-8.

stevexff · Jun 6, 2007

OK, so it's out of your control. But you can still look at the tables on SQL Server, right?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

ifmo114 · Jun 6, 2007

First, thanks for your help with this.

No, I can't access the SQL Server tables directly (i.e. Query Analyzer) but I can access them through MS Access. Is there something I can do there?

CautionMP · Jun 7, 2007

ifmo114,
A couple of thoughts, depending on how your writting the text file you may be able to do a binary replacement. If the the 'fancy' quotes are the only offending characters and your writting the file in VBA ([tt]Write #[/tt] or [tt]Print #[/tt]) you might try something like this, assumes that [tt]strOut[/tt] contains a record from your table just before it is output to the destination file:

Code:

strOut = Replace(strOut, Chr(226) & Chr(128) & Chr(156), "", , , vbBinaryCompare) 'Chr(147)
strOut = Replace(strOut, Chr(226) & Chr(128) & Chr(157), "", , , vbBinaryCompare) 'Chr(148)

And just a point to clarify:

stevexff said:
anything special or unusual is 2 bytes wide

UTF-8 will support characters up to 6 bytes wide, although anything wider than 4 would probably be represented in UTF-16 or UTF-32.
HEX values 93 & 94 are both 3 bytes wide in UTF-8.

Hope this helps,
CMP

[small]For the best results do what I'm thinking, not what I'm saying.[/small]
(GMT-07:00) Mountain Time (US & Canada)

ifmo114 · Jun 7, 2007

So does Chr(226) & Chr(128) & Chr(156/7) represent the fancy beginning and end double quotes? It renders as other special characters.

Is there a long that represents those values so I can replace them?

Could I use the Asc function somehow to accomplish this? There are actually many special characters that could occur.

What if I found out the length of the character in bytes and just removed anything that was longer than 1 byte? How would I figure out the length?

Thanks.

CautionMP · Jun 7, 2007

ifmo114,
Yes, and yes it will render as other characters if viewed in ASCII (single byte structure).

You can do the conversion to a long value [tt]226 x (256)² + 128 x (256)¹ + 116 = 14844060[/tt] which I tried but didn't work since the [tt]Replace()[/tt] function expects the Replace value to be a string.

If you could get the value from the table field into a byte array without the unicode padding (a simple assignement statement will take ASCII text and pad it to two bytes) you could use a loop of the byte array to pull out only the ASCII characters using the value of the first byte to determine the character length in bytes.

In the following example [tt]b[/tt] is a byte array I'm populating from a text file using a Get statement.

Code:

Dim l As Long
Dim s As String
For l = LBound(b) To UBound(b)
  Select Case b(l)
    Case Is < 128
      'Grab only the ASCII characters
      'and append to s
      s = s & Chr(b(l))
    Case Is < 192
      l = l + 1
    Case Is < 224
      l = l + 2
    Case Is < 240
      l = l + 3
    Case Is < 248
      l = l + 4
    Case Is < 252
      l = l + 5
  End Select
Next l
'Output s
Debug.Print s

Hope this helps,
CMP

[small]For the best results do what I'm thinking, not what I'm saying.[/small]
(GMT-07:00) Mountain Time (US & Canada)

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Find and remove non UTF-8 characters

ifmo114

Programmer

stevexff

Programmer

ifmo114

Programmer

MakeItSo

Programmer

ifmo114

Programmer

stevexff

Programmer

ifmo114

Programmer

stevexff

Programmer

ifmo114

Programmer

CautionMP

Programmer

ifmo114

Programmer

CautionMP

Programmer

Similar threads

Part and Inventory Search

Sponsor