Reading non-standard charcters with StreamReader

mtbPierre · Dec 15, 2003

I'm using a StreamReader to process files from different countries in this simple way:

Code:

StreamReader sr = new StreamReader(@&quot;c:\test&quot;,System.Text.Encoding.UTF8);
while((row=sr.ReadLine()) != null)
{
   //process the line
}
sr.Close();

However the reader just ignores non-standard characters such as Ø in Danish for example. These are treated by the reader as if they do not exist. If I change the Encoding type to ASCII they are replaced by question marks and in unicode cause an error.

The program needs to run on a British server and process files from several different European countries.

Any ideas please?

chiph · Dec 15, 2003

Does your file have the UTF-8 BOM characters at the front of it? They might be necessary to tell the reader that you have a UTF-8 file.

Chip H.

If you want to get the best response to a question, please check out FAQ222-2244 first

mtbPierre · Dec 15, 2003

The problem is that the file is not UTF8 or any of the other formats that the StreamReader seems to support. The file is text from a UNIX system. The characters are all (I think) extended ASCII. The files will open fine in notepad or the .NET IDE or whatever.

It would seem .NET has a textreader that does not read text unless it is English text!

chiph · Dec 16, 2003

By using the Encoding.UTF8 class, you're telling it to expect Unicode text. While this should work (since Unicode knows about Unix-style linebreak characters), I don't know why it's not for you.

I suggest you look at the file with a hex editor (

http://www.expertcomsoft.com

has a good one) and see what characters are in your file, and see if you can spot one or more that might be causing problems. You might also want to start with a small file and then work your way up gradually to larger files that have some of the characters you're seeing in the problem file.

Chip H.

If you want to get the best response to a question, please check out FAQ222-2244 first

mtbPierre · Dec 16, 2003

I've done some research elsewhere and it would appear I'm not the only one to have discovered this problem. Just do a search on Google groups and you'll find loads - none with a fix.

You can also use for example:

Code:

StreamReader sr = new StreamReader(@&quot;c:\test&quot;,System.Text.Encoding.GetEncoding(1252));

to specify the encoding to use from a codepage (in this case Windows ANSI which does contain all the characters needed). This however still does not work with my files.

In order to get round the problem I am currently rewriting the part of my application that does this to read the files in a byte at a time from a FileStream object. This works but is obviously more of a pain.

I think the problem with this lies in the fact that unicode is a 16 bit encoding and the files are single byte. For the foreign letters unicode expects 2 bytes but only sees one so does not recognise the character.

We appear to have a text reader that only reads (US?) English text.

.NET which was supposed to make Globalization and cross platform apps so easy has failed me on my first scenario!

djinn01 · Dec 20, 2003

Have you tried opening the file with codepage 1252 encoding? You may not have the same problem as in the other post you cite. But I have a hunch this may just NOT be codepage 1252 encoding.

Try this:

1) Open the file in VS (if it reads fine there)
2) Select File > Save as
3) Using the down-arrow selector on the Save button, select Save with Encoding
4) The current encoding should be apparent in the top-selector of the Advanced Save Options dialog.

This may help you figure out what is going on. If indeed it is just a case of a different codepage being used, use an encoding built for it ( Encoding.GetEncoding(codepagenumber) ).

I don't know of any programmatic way to detect the encoding of a file. If you find one, let me know.

chiph · Dec 21, 2003

I don't know of a way to detect encoding either. I know that Internet Explorer examines the page and does some statistical analysis to make a guess at the encoding (sometimes it gets it wrong).

Chip H.

If you want to get the best response to a question, please check out FAQ222-2244 first

theoxyde · Dec 22, 2003

The 1252 codepage works for the one character you mentioned, but you would need to check it out on you files.

mtbPierre · Dec 22, 2003

1252 coding does not work on my files as I mentioned above. Neither do any of the other encodings.

I tried djinn01's technique for finding the encoding of the file. 1252 was what came up. VS maybe makes a guess like IE? It's weird everything that you open the file with works ok except for reading it with a StreamReader - which is supposed to be the tool for reading text files like this!

I've re-written my app reading the files as bytes from a FileStream object and it is working ok.

Thanks for your imput folks.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reading non-standard charcters with StreamReader

mtbPierre

Programmer

chiph

Programmer

mtbPierre

Programmer

chiph

Programmer

mtbPierre

Programmer

djinn01

Programmer

chiph

Programmer

theoxyde

Programmer

mtbPierre

Programmer

Similar threads

Part and Inventory Search

Sponsor