Change encoding ANSI->UTF8 2

xisabella · Jul 27, 2007

Hi,

I have been looking for a solution to change the encoding for a text file from ANSI to UTF8 through the use of VBS. Anyone has any idea?

Thanks!

Michelle

Xisabella

dilettante · Jul 28, 2007

One easy way to do this is via an ADO Stream object.

You need to be very precise about your meaning when you say "ANSI" though. Typically a Windows user probably means "extended ANSI" however, which is the 7-bit ASCII character set plus one of the extensions to add values 128-255. Most likely this would be symbols from the accented character Latin codepage, 1252.

The ADO Stream object can accomodate a wide range of text encodings though, using pretty much every value found under:

[tt]HKEY_CLASSES_ROOT\MIME\Database\Charset[/tt]

Here is an example, written as a WSH script:

ANSI to UTF8.wsf

Code:

<job>
  <reference object="ADODB.Stream" />
  <object id="stmANSI" progId="ADODB.Stream" />
  <object id="stmUTF8" progId="ADODB.Stream" />
  <script language="VBScript">
    Option Explicit

    With stmANSI
      .Open
      .LoadFromFile "ANSI.txt"
      .Type = adTypeText
      .LineSeparator = adCRLF
      .Charset = "x-ansi" 'Same as Windows-1252.
      With stmUTF8
        .Open
        .Type = adTypeText
        .LineSeparator = adLF
        .CharSet = "UTF-8"
        .WriteText stmANSI.ReadText(adReadAll)
        .SaveToFile "UTF8.txt", adSaveCreateOverWrite
        .Close
      End With
      .Close
    End With

    MsgBox "Done", vbOkOnly, "Conversion Complete!"
  </script>
</job>

If you are programming for some other script host such as classic ASP or you prefer to use the 1990s-style "naked VBS" files instead of a .WSF you will have to create instances of the Stream object on the fly via [tt]CreateObject()[/tt] calls and provide the ADO constants manually.

The constants needed can be found at Stream Object Properties, Methods, and Events.

devRyan · Aug 22, 2007

Star for dilettante!

This thread just saved my day. I've been trying to export a text file from DTS as UTF-8 with no luck, thus far. I'd searched all over the web for ideas then figured I'd try the trusty Tek-Tips. I found this thread and now by DTS package is a success.

Thanks dilettante.

dilettante · Aug 23, 2007

Glad it was helpful!

devRyan · Aug 23, 2007

I do have a question about this though that I hope you can help me with, now they I've had a chance to see how it reacts with the target system.

When the file is translated from ANSI into UTF-8, it adds a space between every character. Do you know of any way to prevent this? Apparently, the system that I'm uploading my file to does not recognize the spaces and rejects the file.

What I'm sending them is a dlimited text file, with a double pipe as the delimiter. I'm receiving errors stating that my column length is not as expected, which leads me to believe that it is not seeing the pipes as double pipes anymore, but as pipe-space-pipe.

Also, I've got another issue. When I submit this file immediately after generation, the submission fails, stating that I "can't submit a binary file(ex.Word,Excel,etc)", trhat it must be UTF-8. Though when I open the file in Notepad, then click the Save As.. option, the Encoding defaults to UTF-8, as it should, since it was created as UTF-8. If I continue with the SaveAs, as UTF-8, and replace the existing file, it submits with no problems. Any ideas why this may be happening?

dilettante · Aug 23, 2007

When I read your comments above I looked over my example and I thought "Wow! And nobody else has called me out on those gaffes before now?"

Let's see if I can unravel some of the mess I left here.

To begin with, the code I showed above is slightly buggy. The Stream's LineSeparator property is meaningless unless you are using Line-oriented operations. So to translate the line-ends you are forced to loop through the data line by line.

The second issue is that the Stream object likes to use Byte Order Marks (BOMs) in Text type Streams. If you don't want the BOM you have to go through an extra step in Binary mode to skip over it. A UTF-8 BOM is 3 bytes long, preceding the body text of the file.

Many systems that process UTF-8 text don't look for the optional BOM, and may gag on it. Notepad is Unicode-aware and can be used to do several encoding translations. When opening a file it respects any BOM it encounters. When saving as UTF-8 it does not write the BOM out to disk.

This gets us the following, somewhat more generalized example:

ANSI to UTF-8.wsf

Code:

<job>
  <reference object="ADODB.Stream" />
  <object id="stmANSI" progId="ADODB.Stream" />
  <object id="stmUTF8" progId="ADODB.Stream" />
  <object id="stmNoBOM" progid="ADODB.Stream" />
  <script language="VBScript">
    Option Explicit

    With stmANSI
      .Open
      .Type = adTypeBinary
      .LoadFromFile "ANSI.txt"
      .Type = adTypeText
      .LineSeparator = adCRLF
      .Charset = "x-ansi" 'Same as Windows-1252.
      With stmUTF8
        .Open
        .Type = adTypeText
        .LineSeparator = adLF
        .CharSet = "UTF-8"

        'Use Line operations to obtain LineSeparator action.
        Do Until stmANSI.EOS
          .WriteText stmANSI.ReadText(adReadLine), adWriteLine
        Loop
        stmANSI.Close

        'Skip over UTF-8 BOM.
        .Position = 0 'Enable Type change.
        .Type = adTypeBinary
        .Position = 3 'UTF-8 BOM is in positions 0 through 2.

        With stmNoBOM
          .Open
          .Type = adTypeBinary
          stmUTF8.CopyTo stmNoBOM
          stmUTF8.Close

          .SaveToFile "UTF8.txt", adSaveCreateOverWrite
          .Close
        End With
      End With
    End With

    MsgBox "Done", vbOkOnly, "Conversion Complete!"
  </script>
</job>

The astute observer will note that the original example should have used the CopyTo method instead of WriteText of the result from ReadText. This is just a small optimization though, and all of the flaws would remain.

Now this should work great.. if your input file is truly ANSI and not "Windows Unicode" (which is little-endian UTF-16)! Your "adds a space between every character" leads me to suspect just that of your situation.

Instead you might want to pattern your script on something like:

Unicode to UTF-8.wsf

Code:

<job>
  <reference object="ADODB.Stream" />
  <object id="stmUnicode" progId="ADODB.Stream" />
  <object id="stmUTF8" progId="ADODB.Stream" />
  <object id="stmNoBOM" progid="ADODB.Stream" />
  <script language="VBScript">
    Option Explicit

    With stmUnicode
      .Open
      .Type = adTypeText
      .LineSeparator = adCRLF
      .Charset = "Unicode"
      .LoadFromFile "Unicode.txt"
      With stmUTF8
        .Open
        .Type = adTypeText
        .LineSeparator = adLF
        .CharSet = "UTF-8"

        'Use Line operations to obtain LineSeparator action.
        Do Until stmUnicode.EOS
          .WriteText stmUnicode.ReadText(adReadLine), adWriteLine
        Loop
        stmUnicode.Close

        'Skip over UTF-8 BOM.
        .Position = 0 'Enable Type change.
        .Type = adTypeBinary
        .Position = 3 'UTF-8 BOM is in positions 0 through 2.

        With stmNoBOM
          .Open
          .Type = adTypeBinary
          stmUTF8.CopyTo stmNoBOM
          stmUTF8.Close

          .SaveToFile "UTF8.txt", adSaveCreateOverWrite
          .Close
        End With
      End With
    End With

    MsgBox "Done", vbOkOnly, "Conversion Complete!"
  </script>
</job>

Note that the LoadFromFile and SaveToFile methods are doing a simple Binary read (and write) operation. They simply move the exact Binary image between the Stream and file, with no attempt at any translation or formatting and do not look for or create BOMs. BOMs must either be in the data loaded, or created (implicitly, according to the CharSet value) by WriteText operations on a Text type Stream where Position = 0.

The CopyTo method also works on Binary bytes, and starts copying bytes from the current Position. This is how we are "deleting" the BOM in the two examples here. You can only truncate from the end of a Stream, you can't trim data from the beginning or the middle of a Stream's contents.

Are we confused yet?

I hope this fixes most of the glitches in my original example and also helps address your own situation better. If there are still issues after you try this I'm sure we can try to help you further here.

devRyan · Aug 24, 2007

Another star for dilettante!

Thank you for taking the time to write up such an insightful post. Though, I may not understand exactly what is going on in all stages(like the skip over BOM part), it did solve my problems, I believe, at least at first glance. It did however resolve the 2 questions I asked, so Bravo Zulu to you.

The target system now accepts the file as UTF-8 from initial generation, without having to resave it in Notepad. This was one of my biggest obstacles, as we are trying to use this DTS package in an automated system.

You were right about the file being in unicode. I probably should've noted before that it was being generated by SQL Server's DTS DataPump. Once I replaced the code to work with Unicode, all to the excess spaces in the file disappeared.

Hopefully the target system will like the changes. It doesn't load the files but once a day, after midnight, so I won't know how successful it really is until Monday.

Either way, from what I can tell, if it does encounter problems, they should be pretty easy to fix as it's just a matter of formatting from here, like whether it prefers a LF or a CRLF. We'll see.

Again, Thank you. Your help has been invaluable.

Ryan

dilettante · Aug 24, 2007

Good luck with it!

I'm sorry I didn't catch my errors and omissions earlier. I got lucky before, because the BOM prefix and CRLFs were not a problem for the software I was using this with.

Copying the data twice (once as text to convert the encoding, second time as binary to strip the BOM) isn't optimal. However for files less than huge it shouldn't be a problem.

Maybe someone else can improve on this or offer some alternative approach?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Change encoding ANSI->UTF8 2

xisabella

Programmer

dilettante

MIS

devRyan

Programmer

dilettante

MIS

devRyan

Programmer

dilettante

MIS

devRyan

Programmer

dilettante

MIS

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Change encoding ANSI-&gt;UTF8 2

Programmer

MIS

Programmer

MIS

Programmer

MIS

Programmer

MIS

Similar threads

Log in

Part and Inventory Search

Sponsor

Change encoding ANSI->UTF8 2