Strange special characters in textfile

frag · Aug 10, 2005

Hi all,

I have a big problem with a small textfile.

I read a textfile that was downloaded from a webserver... like this:

Code:

String line;
FileInputStream fin =  new FileInputStream(strImportFile);
BufferedReader myInput = new BufferedReader(new InputStreamReader(fin));

try
{
    // read file (line for line)
    while ((line = myInput.readLine()) != null)
    {  
      System.out.println(line);
    }
}
.
.
.

The output (in Eclipse) looks like this:

Code:

.H.e.l.l.o.

Where the "." are not really dots, they are just displayed as small dots in the Eclipse console. Ok, I would not care about this, but if I try to get the length of the string it is not 5 (as expected), it is 11 !

I have already tried to convert the file with unix2dos (I am working with Windows 2000), but this will just fix the carriage-return/linefeed problem.

Does anyone have a clue what the problem is with this file??

I could filter these chars but I don't know for what kind of special is used in this file. I also wonder where this stuff comes from.

If you want to reproduce my problem just download the file here:

https://mfi-assets.ecb.int/dla_EA.htm

The file I am talking about is the file under "All eligible assets", "Selection", "Full Database", "Uncompressed"... uhm... to make it easier... it is the file in the upper right corner of the page.

I appreciate any comment about this issue!

Cheers
frag

real_firestorm@gmx.de

frag · Aug 10, 2005

Ok, solved this problem on my own... the file contains NULL-characters. I now just filter them with:

Code:

line.replaceAll("\\x00", "")

Perhaps this might by handy for someone else.

Sorry for double post...

frag

real_firestorm@gmx.de

timw · Aug 10, 2005

Perhaps the text file is using unicode encoding. Each character would then use two bytes, one of which is zero for the regular 8-bit character sets.

Tim
---------------------------
"Your morbid fear of losing,
destroys the lives you're using." - Ozzy

timw · Aug 10, 2005

Maybe if you tried

Code:

String line;
FileInputStream fin =  new FileInputStream(strImportFile);
BufferedReader myInput = new BufferedReader(new InputStreamReader(fin, "UTF-16"));

try
{
    // read file (line for line)
    while ((line = myInput.readLine()) != null)
    {  
      System.out.println(line);
    }
}

This might deal with it more correctly than replacing zero chars.

Tim
---------------------------
"Your morbid fear of losing,
destroys the lives you're using." - Ozzy

frag · Aug 10, 2005

Hi timw,

thank you for the tip. But it won't do the trick for me in this case.

The file has tab seperated values. If I use UTF-16 the "empty" fields (two tabs in a row) will disappear during the read process.

This file seems to come from a host-system with strange encoding... anyhow my solution is working for this file.

Cheers
frag

real_firestorm@gmx.de

PaulMcKay · Aug 22, 2005

I think you've most likely come across a "Microsoft Text File". I did myself last week. Text files used to be sequences of Ascii characters. No longer. Now we have Unicode. There are several ways of representing Unicode characters. You can tell which way has been used by looking at the first 3 or 4 bytes. (If they are all regular Ascii characters then it's an old-style normal text file.) You need to use the correct procedure to unscramble stuff. From the look of the output, i guess you may have a file in the simple format which just uses 2 bytes for each character. Ignoring alternate bytes will decode it.

What you should do next depends on where you will accept your text files from. Windows Notepad uses a different unicode encoding, for example.

frag · Aug 22, 2005

You guys were all right!

It is indeed a unicode formatted file and timw's posted code is working (UTF-16).

My first mistake was that I thought it was in ebcdic-code, my second mistake was that I did not know that StringTokenizer does not return an empty value if it hits two delimiter in a row... it's just ignoring it and jumps to the next value. That's why I am using the "new" String method "split" which returns empty fields.

real_firestorm@gmx.de

timw · Aug 22, 2005

frag said:
timw's posted code is working (UTF-16)

Glad it worked [wink]

Tim
---------------------------
"Your morbid fear of losing,
destroys the lives you're using." - Ozzy

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Strange special characters in textfile

frag

Programmer

frag

Programmer

timw

Programmer

timw

Programmer

frag

Programmer

PaulMcKay

Programmer

frag

Programmer

timw

Programmer

Similar threads

Part and Inventory Search

Sponsor