Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

UTF-8 Encoding/Decoding in Java 1

Status
Not open for further replies.

plunkett

Technical User
Jun 10, 2004
57
US
Hi.

I've created an application that communicates with a remote server. The creators of the server decided to UTF-8 encode parts of data as it's sent to my client, but some parts of it are left unchanged. My problem is that I can read the utf-8 encoded parts fine, but when non-standard ascii characters appear in the non-encoded parts, I can't handle it correctly (I'm getting question marks for these characters). I'm running Linux and my locale is utf-8. I really don't know what I need to do... can anyone help?
 
Characters such as ¿Üƒ etc... are not part of the US-ASCII character set.. when sent un-utf8 encoded, they don't display correctly in java... or maybe I don't understand what you are trying to tell me?
 
Are you sure they're ASCII, and not ANSI? ASCII is a 7-bit encoding, and like JoelNaten said, it maps perfectly to the first 127 characters in UTF-8.

However, if they're sending you ANSI, which is an 8-bit encoding, you've got a problem, as the high bit in UTF-8 indicates there's multiple bytes representing the one character. So if the data contains a value above 127 decimal, then you've got a problem.

Sounds like you need to read the data in chunks (after beating them upside the head for such a brain-dead idea). You'll need to read however many bytes make up the ANSI portion, then read however many bytes that make up the UTF-8 portion, looking for some signature to know when the UTF-8 portion ends (you can't count bytes, as UTF-8 is a multi-byte encoding). Ugly.

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
No, i'm pretty sure they are not ascii... how else would stuff like ܺ¿ê show up?

But to answer your other question...

Data is sent delimited by "#"

for example... jeiܺ¿êei#fi3ܺ¿êdif

The later part is utf8 encoded, the first part is not...

In Java, I get... jei????ei#fi3ܺ¿êdif

Therefore, the problem of be finding which part is encoded and which part is not is not really a problem. I just am trying to find out what to do to this first part to make it come out correctly....
 
I guess is the way the data being read got everything in UTF-8.

I would suggest to treat incoming data as bytes, look for the '#' byte to split the data into two different UTF-8 and non-UTF8 byte[], then convert them into String separately.



 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top