Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Open File as Unicode (UTF-16) and Save as UTF-8 1

Status
Not open for further replies.

Swi

Programmer
Feb 4, 2002
1,966
US
Hi,

Are there any drawbacks to opening up a file as Unicode (UTF-16) when it comes in and save it as UTF-8 with the FSO? Would it cause any conversion issues?

Thanks.

Code:
Set InStream = fso.OpenTextFile(ProcessFolder & U_pageflexDataFile, ForReading)
intAsc1Chr = Asc(InStream.Read(1))
intAsc2Chr = Asc(InStream.Read(1))
InStream.Close
If intAsc1Chr = 255 And intAsc2Chr = 254 Then
   OpenAsUnicode = True
Else
   OpenAsUnicode = False
End If
If OpenAsUnicode = True Then
   Set OutStream = fso.CreateTextFile("C:\Swiler\D-AetMMS_9-00131896_00001-rev.csv", True, False)
   Set InStream = fso.OpenTextFile("C:\Swiler\Old.csv", ForReading, False, TristateTrue)
   OutStream.Write InStream.ReadAll
   OutStream.Close
   Set InStream = fso.OpenTextFile(ProcessFolder & U_pageflexDataFile, ForReading, False, TristateTrue)
   errLog = errLog & "Invalid data file format for <strong>" & U_pageflexDataFile & "</strong>, file is UTF-16 and should be UTF-8, please check the PageFlex integration<BR>"
Else
   Set InStream = fso.OpenTextFile(ProcessFolder & U_pageflexDataFile, ForReading)
End If

Swi
 
Erm... ThE the FSO doesn't do UTF-8 (except in the limited case that you only want to write ASCII characters). Have a look at ADODB.Stream instead
 
So something like this? Is this the best method to detect UTF-16 and convert to UTF-8?

Code:
CODE
Set InStream = fso.OpenTextFile(ProcessFolder & U_pageflexDataFile, ForReading)
intAsc1Chr = Asc(InStream.Read(1))
intAsc2Chr = Asc(InStream.Read(1))
InStream.Close
If intAsc1Chr = 255 And intAsc2Chr = 254 Then
   OpenAsUnicode = True
Else
   OpenAsUnicode = False
End If
If OpenAsUnicode = True Then
   Set InStream = fso.OpenTextFile("C:\Swiler\Old.csv", ForReading, False, TristateTrue)
   strText = InStream.ReadAll
   Const adTypeText = 2
   Const adSaveCreateOverWrite = 2

   With CreateObject("ADODB.Stream")
       .Type = adTypeText
       .Charset = "utf-8"
       .Open
       .WriteText strText
       .SaveToFile "C:\Swiler\New.csv", adSaveCreateOverWrite 
   End With

Else
   Set InStream = fso.OpenTextFile(ProcessFolder & U_pageflexDataFile, ForReading)
End If


Swi
 
Additional question. As a test I have a UTF-8 csv and a UTF-16 csv file.

FSO seems to read the UTF-8 encoded file that I checked in NotePad++ (the encoding). So, am I to assume that it converts it to ASCII characters automatically when FSO attempts to read UTF-8? Because when it tries to read UTF-8 BOM it seems to open the file file but I get  at the beginning because of the BOM. For the UTF-16 I definitely need to have the TristateTrue for FSO to read the file successfully. I just want to fully understand. Thanks.

Code:
Dim fso As New FileSystemObject
Dim UTF8Instream As TextStream
Dim UTF16Instream As TextStream

Set UTF8Instream = fso.OpenTextFile("C:\Swiler\UTF-8.csv", ForReading, False)
Set UTF16Instream = fso.OpenTextFile("C:\Swiler\UTF-16.csv", ForReading, False, TristateTrue)

MsgBox UTF8Instream.ReadAll
MsgBox UTF16Instream.ReadAll

UTF8Instream.Close
UTF16Instream.Close

Set fso = Nothing

Swi
 
1) The UTF-8 standard explicitly advises against using a BOM

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM"

2) UTF-8 is 100% backwards compatible with ASCII - for Unicode code points 0 to 127 an ASCII file and a UTF-8 file are indistinguishable. FSO assumes that it is reading ASCII* by default, and makes no attempt to look for or read a BOM.

*More accurately ANSI, which is also backwards compatible with ASCII
 
Here's some code that may be of help (assumes you have a reference to ADODB)

Code:
Public Sub example()
    Dim instream As New ADODB.Stream
    Dim outstream As New ADODB.Stream
    Dim outstreamnobom As New ADODB.Stream

    instream.Open
    instream.LoadFromFile ("f:\deleteme\utf16.txt") [COLOR=green]' figures out what kind of ASCII/ANSII/UNicode sourece file (not 100% perfect when dealing with Code pages)[/color]
    
    
    outstream.Open
    outstream.Charset = "UTF-8" [COLOR=green]' will result in a UTF-8 stream with a BOM[/color]

    instream.CopyTo outstream
    outstream.SaveToFile "f:\deleteme\utf8bom.txt", adSaveCreateOverWrite
    
    outstreamnobom.Type = adTypeBinary
    outstreamnobom.Open
    outstream.Position = 3 [COLOR=green]' Move start to end of BOM[/color]
    outstream.CopyTo outstreamnobom
    outstreamnobom.SaveToFile "f:\deleteme\utf8nobom.txt"
    
End Sub
 
Thank you so much for the feedback strongm. Makes a lot of sense that FSO assumes ASCII and therefore reads the UTF-8 but gives garbage at the beginning of UTF-8 BOM.

Again, I appreciate the information.

This one is kind of difficult as users seem to be uploading many different types of file encodings. Thanks again.

Swi
 
strongm, worked great but I literally had to tell it whether it was UTF-16 or not. Otherwise it was not converting it properly.

Thanks.

Code:
                    Set Instream = fso.OpenTextFile(ProcessFolder & U_pageflexDataFile, ForReading)
                    intAsc1Chr = Asc(Instream.Read(1))
                    intAsc2Chr = Asc(Instream.Read(1))
                    Instream.Close
                    If intAsc1Chr = 255 And intAsc2Chr = 254 Then
                        OpenAsUnicode = True
                    Else
                        OpenAsUnicode = False
                    End If
                    If OpenAsUnicode Then
                    ' Deal with UTF-8, UTF-8 BOM, UTF-16 and UTF-16 BOM and converto to UTF-8
                    If fso.FileExists(UTFBackupFolder & U_pageflexDataFile) Then fso.DeleteFile UTFBackupFolder & U_pageflexDataFile, True
                        fso.CopyFile ProcessFolder & U_pageflexDataFile, UTFBackupFolder & U_pageflexDataFile
                        ADOInStream.Open
                        ADOInStream.LoadFromFile (ProcessFolder & U_pageflexDataFile) ' figures out what kind of ASCII/ANSII/UNicode sourece file (not 100% perfect when dealing with Code pages)
                        ADOOutStream.Open
                        ADOOutStream.Charset = "UTF-8" ' will result in a UTF-8 stream with a BOM
                        ADOInStream.CopyTo ADOOutStream
                        ADOOutStream.SaveToFile ProcessFolder & U_pageflexDataFile, adSaveCreateOverWrite
                        ADOOutStreamNoBOM.Type = adTypeBinary
                        ADOOutStreamNoBOM.Open
                        ADOOutStream.Position = 3 ' Move start to end of BOM
                        ADOOutStream.CopyTo ADOOutStreamNoBOM
                        ADOOutStreamNoBOM.SaveToFile ProcessFolder & U_pageflexDataFile, adSaveCreateOverWrite
                        ADOOutStream.Close
                        ADOOutStreamNoBOM.Close
                    End If

Swi
 
>not converting it properly

Yeah, as I kind of hinted at in one of my REMarks, the built-in detection is not perfect, particularly if the input is UTF-8

(and I probably should point out here that you only seem to be checking for a UTF-16LE BOM; UTF-16BE BOM has the bytes the other way arounbd; and for completeness sake, the UTF-8 BOM is EF BB BF)
 
Thanks for the information.

Swi
 
strongm,

Sorry have been on other projects. So to wrap this up to check UTF-16LE BOM and UTF-16BE BOM I would do the following?

Code:
If intAsc1Chr = 255 And intAsc2Chr = 254 Then ' Check for UTF-16LE BOM
  OpenAsUnicode = True
ElseIf intAsc1Chr = 254 And intAsc2Chr = 255 Then ' Check for UTF-16BE BOM
  OpenAsUnicode = True
Else
  OpenAsUnicode = False
End If

Thanks.

Swi
 
Yep, pretty much (although I'd be tempted to go with an If ... Or ... construct, but that's just me ...)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top