Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

XML reader fails on foreign characters 1

Status
Not open for further replies.

jojo11

Programmer
Feb 2, 2003
189
US
I have an application that reads in XML files that at times contains german characters like the u with the 2 dots on top. I am using the .NET XML reader. When I specify encoding as UTF-8 in the header of the XML file, the foreign characters fail. When I tried using windows-1252 it was OK.
Can someone explian why? I hate just putting a fix out there when I don't understand whey it was fixed. Also, was this a proper solution?

Thanks

-------------------------------------------
Ummm, we have a bit of a problem here....
 
Thank you very much.
I tried the iso-8859-1 reference and it worked great.
Can I assume that everything that was supported by UTF-8 will also be supported by iso-8859-1?

-------------------------------------------
Ummm, we have a bit of a problem here....
 
SgtPeppa is leading you a bit astray.

In increasing order of breadth of characters covered:
[tab]ISO-8859-1
[tab]Windows-1252
[tab]Unicode (UTF8, UTF16, UTF32, etc)

It is true that 8859-1 is geared towards Western European languages, but 1252 supports more characters (like from the Norwegian area). Unicode supports almost the entire planet's character sets.

What may be happening in your case is that the document was originally encoded in 8859-1 or a similar encoding. 8859-1 and 1252 are single-byte encodings, meaning that they can express at most 255 different characters. UTF-8 is a variable-length encoding, and can express over 1.1 million characters (which would require 3 bytes to do so).

The only sure way to tell if a document is encoded in UTF-8 is if you see the first three bytes are 0xEF 0xBB 0xBF. This is the UTF-8 Byte Order Mark:
[tab]Unfortunately, it's an optional element of a UTF-8 file, so a missing BOM doesn't definitely say it's not a UTF-8 encoding.

In your case, it sounds like it wasn't a UTF-8 file. But you should still contact the person who sent you the file to determine what encoding they used to create it. It might not have been 8859-1.

Chip H.


____________________________________________________________________
Click here to learn Ways to help with Tsunami Relief
If you want to get the best response to a question, please read FAQ222-2244 first
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top