UTF-8 Encoding/Decoding in Java 1

plunkett · Jul 26, 2004

Hi.

I've created an application that communicates with a remote server. The creators of the server decided to UTF-8 encode parts of data as it's sent to my client, but some parts of it are left unchanged. My problem is that I can read the utf-8 encoded parts fine, but when non-standard ascii characters appear in the non-encoded parts, I can't handle it correctly (I'm getting question marks for these characters). I'm running Linux and my locale is utf-8. I really don't know what I need to do... can anyone help?

JoelNaten · Jul 26, 2004

UTF-8 is identical to ASCII. There's nothing you need to do to convert them.

http://www.blogeasy.com/

sedj · Jul 26, 2004

This somes it up nicely :

http://en.wikipedia.org/wiki/UTF-8

--------------------------------------------------
Free Database Connection Pooling Software

http://www.primrose.org.uk

plunkett · Jul 27, 2004

Characters such as ¿Üƒ etc... are not part of the US-ASCII character set.. when sent un-utf8 encoded, they don't display correctly in java... or maybe I don't understand what you are trying to tell me?

chiph · Jul 27, 2004

Are you sure they're ASCII, and not ANSI? ASCII is a 7-bit encoding, and like JoelNaten said, it maps perfectly to the first 127 characters in UTF-8.

However, if they're sending you ANSI, which is an 8-bit encoding, you've got a problem, as the high bit in UTF-8 indicates there's multiple bytes representing the one character. So if the data contains a value above 127 decimal, then you've got a problem.

Sounds like you need to read the data in chunks (after beating them upside the head for such a brain-dead idea). You'll need to read however many bytes make up the ANSI portion, then read however many bytes that make up the UTF-8 portion, looking for some signature to know when the UTF-8 portion ends (you can't count bytes, as UTF-8 is a multi-byte encoding). Ugly.

Chip H.

____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first

plunkett · Jul 27, 2004

No, i'm pretty sure they are not ascii... how else would stuff like Üº¿ê show up?

But to answer your other question...

Data is sent delimited by "#"

for example... jeiÜº¿êei#fi3Üº¿êdif

The later part is utf8 encoded, the first part is not...

In Java, I get... jei????ei#fi3Üº¿êdif

Therefore, the problem of be finding which part is encoded and which part is not is not really a problem. I just am trying to find out what to do to this first part to make it come out correctly....

stefanwagner · Jul 27, 2004

but some parts of it are left unchanged.

left unchanged as what?

And perhaps your locales differ?

seeking a job as java-programmer in Berlin:

http://home.arcor.de/hirnstrom/bewerbung

plunkett · Jul 27, 2004

NOt sure... maybe unicode?

stefanwagner · Jul 27, 2004

or maybee iso-latin-4?

seeking a job as java-programmer in Berlin:

http://home.arcor.de/hirnstrom/bewerbung

byam · Jul 27, 2004

I guess is the way the data being read got everything in UTF-8.

I would suggest to treat incoming data as bytes, look for the '#' byte to split the data into two different UTF-8 and non-UTF8 byte[], then convert them into String separately.

stefanwagner · Jul 27, 2004

the big 'Ü' is a german letter - perhaps it's iso-latin-1 (western european) or iso-latin-15 (western-european+euro)? I don't know the windows-equivalents...

seeking a job as java-programmer in Berlin:

http://home.arcor.de/hirnstrom/bewerbung

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

UTF-8 Encoding/Decoding in Java 1

plunkett

Technical User

JoelNaten

Programmer

sedj

Programmer

plunkett

Technical User

chiph

Programmer

plunkett

Technical User

stefanwagner

Programmer

plunkett

Technical User

stefanwagner

Programmer

byam

Programmer

stefanwagner

Programmer

Similar threads

Part and Inventory Search

Sponsor