Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Strange special characters in textfile

Status
Not open for further replies.

frag

Programmer
Dec 7, 2000
321
GB
Hi all,

I have a big problem with a small textfile. :(

I read a textfile that was downloaded from a webserver... like this:

Code:
String line;
FileInputStream fin =  new FileInputStream(strImportFile);
BufferedReader myInput = new BufferedReader(new InputStreamReader(fin));

try
{
    // read file (line for line)
    while ((line = myInput.readLine()) != null)
    {  
      System.out.println(line);
    }
}
.
.
.


The output (in Eclipse) looks like this:

Code:
.H.e.l.l.o.

Where the "." are not really dots, they are just displayed as small dots in the Eclipse console. Ok, I would not care about this, but if I try to get the length of the string it is not 5 (as expected), it is 11 !

I have already tried to convert the file with unix2dos (I am working with Windows 2000), but this will just fix the carriage-return/linefeed problem.


Does anyone have a clue what the problem is with this file??

I could filter these chars but I don't know for what kind of special is used in this file. I also wonder where this stuff comes from.

If you want to reproduce my problem just download the file here: The file I am talking about is the file under "All eligible assets", "Selection", "Full Database", "Uncompressed"... uhm... to make it easier... it is the file in the upper right corner of the page.

I appreciate any comment about this issue!

Cheers
frag

real_firestorm@gmx.de
 
Ok, solved this problem on my own... the file contains NULL-characters. I now just filter them with:

Code:
line.replaceAll("\\x00", "")

Perhaps this might by handy for someone else.

Sorry for double post...

frag

real_firestorm@gmx.de
 
Perhaps the text file is using unicode encoding. Each character would then use two bytes, one of which is zero for the regular 8-bit character sets.

Tim
---------------------------
"Your morbid fear of losing,
destroys the lives you're using." - Ozzy
 
Maybe if you tried
Code:
String line;
FileInputStream fin =  new FileInputStream(strImportFile);
BufferedReader myInput = new BufferedReader(new InputStreamReader(fin, "UTF-16"));

try
{
    // read file (line for line)
    while ((line = myInput.readLine()) != null)
    {  
      System.out.println(line);
    }
}

This might deal with it more correctly than replacing zero chars.

Tim
---------------------------
"Your morbid fear of losing,
destroys the lives you're using." - Ozzy
 
Hi timw,

thank you for the tip. But it won't do the trick for me in this case.

The file has tab seperated values. If I use UTF-16 the "empty" fields (two tabs in a row) will disappear during the read process.

This file seems to come from a host-system with strange encoding... anyhow my solution is working for this file.

Cheers
frag

real_firestorm@gmx.de
 
I think you've most likely come across a "Microsoft Text File". I did myself last week. Text files used to be sequences of Ascii characters. No longer. Now we have Unicode. There are several ways of representing Unicode characters. You can tell which way has been used by looking at the first 3 or 4 bytes. (If they are all regular Ascii characters then it's an old-style normal text file.) You need to use the correct procedure to unscramble stuff. From the look of the output, i guess you may have a file in the simple format which just uses 2 bytes for each character. Ignoring alternate bytes will decode it.

What you should do next depends on where you will accept your text files from. Windows Notepad uses a different unicode encoding, for example.
 
You guys were all right!

It is indeed a unicode formatted file and timw's posted code is working (UTF-16).

My first mistake was that I thought it was in ebcdic-code, my second mistake was that I did not know that StringTokenizer does not return an empty value if it hits two delimiter in a row... it's just ignoring it and jumps to the next value. That's why I am using the "new" String method "split" which returns empty fields.


real_firestorm@gmx.de
 
frag said:
timw's posted code is working (UTF-16)

Glad it worked [wink]

Tim
---------------------------
"Your morbid fear of losing,
destroys the lives you're using." - Ozzy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top