XML reader fails on foreign characters 1

jojo11 · Feb 8, 2005

I have an application that reads in XML files that at times contains german characters like the u with the 2 dots on top. I am using the .NET XML reader. When I specify encoding as UTF-8 in the header of the XML file, the foreign characters fail. When I tried using windows-1252 it was OK.
Can someone explian why? I hate just putting a fix out there when I don't understand whey it was fixed. Also, was this a proper solution?

Thanks

-------------------------------------------
Ummm, we have a bit of a problem here....

SgtPeppa · Feb 8, 2005

This is a very short explanation of ISO 8859-1 wich you should have used! UTF-8 Encoded Documents dont understand letters like ÜÖÄ and so on,

but the ISO 8859-1 Encoding was specifically designed for the needs of the Western Europeans Languages.

http://tlt.its.psu.edu/suggestions/international/web/encoding/04latin1.html

Stephan

jojo11 · Feb 8, 2005

Thank you very much.
I tried the iso-8859-1 reference and it worked great.
Can I assume that everything that was supported by UTF-8 will also be supported by iso-8859-1?

-------------------------------------------
Ummm, we have a bit of a problem here....

chiph · Feb 10, 2005

SgtPeppa is leading you a bit astray.

In increasing order of breadth of characters covered:
[tab]ISO-8859-1
[tab]Windows-1252
[tab]Unicode (UTF8, UTF16, UTF32, etc)

It is true that 8859-1 is geared towards Western European languages, but 1252 supports more characters (like from the Norwegian area). Unicode supports almost the entire planet's character sets.

What may be happening in your case is that the document was originally encoded in 8859-1 or a similar encoding. 8859-1 and 1252 are single-byte encodings, meaning that they can express at most 255 different characters. UTF-8 is a variable-length encoding, and can express over 1.1 million characters (which would require 3 bytes to do so).

The only sure way to tell if a document is encoded in UTF-8 is if you see the first three bytes are 0xEF 0xBB 0xBF. This is the UTF-8 Byte Order Mark:
[tab]

http://www.unicode.org/faq/utf_bom.html

Unfortunately, it's an optional element of a UTF-8 file, so a missing BOM doesn't definitely say it's not a UTF-8 encoding.

In your case, it sounds like it wasn't a UTF-8 file. But you should still contact the person who sent you the file to determine what encoding they used to create it. It might not have been 8859-1.

Chip H.

____________________________________________________________________
Click here to learn Ways to help with Tsunami Relief
If you want to get the best response to a question, please read FAQ222-2244 first

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

XML reader fails on foreign characters 1

jojo11

Programmer

SgtPeppa

Programmer

jojo11

Programmer

chiph

Programmer

Similar threads

Part and Inventory Search

Sponsor