Codepage conversion - HEX value problem 2

MakeItSo · Aug 28, 2008

Hi friends,

Once again, I am stuck with a little problem:

I am trying to create formatted RTFs from UTF-8 encoded text files.
However, as I have to be able to process thousands of files and I wish my app to my performant, I am not using the MS Word engine. Instead, I am writing a VB exe, which will write the an RTF header + RTF-encoded text stream directly to a text file.

My problem now lies with characters off my locale codepage, e.g. Russian.

What I have found out so far, is that Word RTFs do not use Unicode but rather an ANSI encoding based on the respective codepage; so the Russian character with unicode value 41F will be encoded as "\'cf" in the RTF, plus an "ansicpg1251" in the RTF header to specify Russian codepage.

Now that's all fine besides one small detail: how the heck can I find out this "cf" value?

I have tried using the ADO.Stream object to convert from UTF-8 to KOI-8 (Russian), but still i cannot get the correct values!

Is there a way to "look up" the respective value of a certain character with respect to a specific codepage?

I know it does not work this way, but I mean be something comparable to:

Code:

myHex=Hex$(ASC$(myChar, codepage:=&quot;1251&quot;))

Thanks a lot for any hint!

Cheers,
Andy

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

dilettante · Aug 28, 2008

I'd have to guess that using the Stream approach you'd load the UTF-8 into the Stream, read the result out into a String as UTF-16LE, then clear the Stream and set the codepage Charset and write the String back into the Stream. Finally change Type to adTypeBinary and remove the translated characters as Byte values.

One wrinkle is probably going to be how you write to your RTF output file. You should be doing this by using binary I/O from Byte arrays, not text I/O from String values. Otherwise you're back to working out the implications of all of the implied translations done when you insert into Strings and write Strings as text.

Or see WideCharToMultiByte?

MakeItSo · Aug 29, 2008

Hi dilettante,

I do load the UTF-8 into the Stream and read the result into a string, although I have no idea what you mean by "into a String as UTF-16LE".

I write this string into another stream with Codepage set to 1251 (=KOI-8, =Russian).

But even reading the result back in with codepage 1251 and then reading out the hex value will not give me the same result that is used in the RTF.
???

Also please explain what you mean by "remove the translated characters as Byte values."

So far I have not worked with byte arrays yet, so everything you said from this line on is like Chinese to me.

Also: how would WideCharToMultiByte help me - and above all: how can I use it at all? Which library do I have to reference for that? MSDN fails to tell...

Thanks a lot!
MiS

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

MakeItSo · Aug 29, 2008

I think there is an easy solution to this:
I could use a richtextbox, set its charset property, read the file contents into it and then save the richtext contents to a file.
Problem now is only that it won't read the contents properly, I only get question marks all the way.
(Real question marks, not just a display problem).

This is how I attempt that:

Code:

Dim ts As ADODB.Stream, ts2 As ADODB.Stream

Set ts = New ADODB.Stream
    With ts
        .Charset = &quot;utf-8&quot;
        .Type = adTypeBinary
        .Open
        .Type = adTypeText
        .LoadFromFile List1.List(i)
        RichTextBox1.TextRTF = .ReadText(adReadAll)
        RichTextBox1.SaveFile &quot;c:\tmp.rtf&quot;
    .Close
    End With

The RichtextBox charset is set to 204, which is Russian.

I thought, it might be the Unicode stream wich the control does not like, so I took another approach:

Code:

Set ts = New ADODB.Stream
    With ts
        .Charset = &quot;utf-8&quot;
        .Type = adTypeBinary
        .Open
        .Type = adTypeText
        .LoadFromFile List1.List(i)
    End With
Set ts2 = New ADODB.Stream
    With ts2
        .Charset = AdoCPs(Langmrk)
        .Type = adTypeBinary
        .Open
        .Type = adTypeText
        .WriteText ts.ReadText(adReadAll)
        .SaveToFile &quot;C:\tmp.txt&quot;
        ts.Close
        RichTextBox1.TextRTF = .ReadText(adReadAll)
        RichTextBox1.SaveFile &quot;c:\tmp.rtf&quot;
    .Close
    End With

With "Langmark" being the listindex of a combobox frmo which I choose the language, AdoCPs being the respective codepage identifiers for the ADO-Stream:

Code:

AdoCPs = Array(&quot;iso-8859-1&quot;, &quot;iso-8859-4&quot;, [b]&quot;koi8-r&quot;[/b], &quot;gb2312&quot;, &quot;big5&quot;, &quot;shift-jis&quot;)

So I read UTF-8, write to Windows-1251 encoding, read the 1251 and try to write it into the RichTextBox.

Won't work yet.

Where am I going wrong?

Thanks!
miS

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

BobRodes · Aug 29, 2008

<Word RTFs do not use Unicode but rather an ANSI encoding based on the respective codepage

I believe that's only the case with RTF 1.0. If you're using the VB Rich Text control to read your directly written text file, you won't get Unicode support. See

http://blogs.msdn.com/murrays/default.aspx?p=3

(scroll past the math stuff at the beginning) for a good summary of RTF history and versions.

Then, perhaps you'll have a look at thread222-1383606 (strongm's comments about halfway down) to see why strongm has been "banging away about" TOM for the last while. You should be able to use the techniques in here to access later versions of RTF that have unicode support, which versions are not directly available in VB6. I would think that this would represent a more elegant solution to your requirement than attempting to work around the lack of support for it in earlier versions.

HTH

Bob

dilettante · Aug 30, 2008

The more recent comments above shed some doubt on your approach. However if your "ANSI" RTF files actually work and you proceed in this manner you might still want to explore translating your input text.

Since VB6 native I/O nor FSO I/O handle UTF-8 anyway, you'll still need to do a conversion in any case. VB6 was a transitional product, and contains many vestiges leftover from the transition of Windows from its DOS ANSI roots to the "Windows Unicode" world. VB6's Strings are normally encoded as "Windows Unicode" which is variously termed UCS-2 or UTF-16LE, etc. The rules are a little funkier under Win9x, but under an NT 4.0 or later Windows they are fairly stable.

In a great many places VB6 will translate String values from Windows Unicode to/from the current 8-bit codepage. One of these places is in the process of native I/O operations against text (non-binary) files.

The ADO Stream object handles text translation in a siilar manner, using a selected Charset MIME type to allow you to choose the codepage or encoding type. This gives it a broader range of encoding conversions than VB6 offers alone, including multibyte encodings.

However when Type = adTypeText the WriteText() method expects a Unicode (UTF-16LE) String value, and the ReadText() method returns a UTF-16LE String value. Always. No exceptions.

To extract encoded data back out of the Stream you must set Type = adTypeBinary and use the Read() method. This returns a Byte array.

I have an example that appears to work properly. The biggest headache here was munging up a UTF-8 Russian sample file. ;-)

Code:

Option Explicit

Private Function To_KOI8(ByVal File As String) As Byte()
    Dim stmXlate As ADODB.Stream
    Dim strText As String
    Dim bytKOI8() As Byte

    Set stmXlate = New ADODB.Stream
    With stmXlate
        .Open
        
        'Get UTF-8 Russian text as UTF-16LE VB String.
        .Type = adTypeText
        .Charset = &quot;utf-8&quot;
        .LoadFromFile File
        strText = .ReadText(adReadAll)
        
        'Empty Stream.
        .Position = 0
        .SetEOS
        
        'Translate UTF-16LE Russian text to KOI-8.
        .Charset = &quot;koi8-r&quot;
        .WriteText strText, adWriteChar
        
        strText = &quot;&quot;
        
        'Fetch KOI-8 text as Byte array.
        .Position = 0
        .Type = adTypeBinary
        bytKOI8 = .Read(adReadAll)
        
        .Close
    End With
    To_KOI8 = bytKOI8
End Function

Private Sub Form_Load()
    Dim bytKOI8() As Byte
    Dim lngMax As Long, lngRow As Long, lngCol As Long
    Dim bytChar As Byte
    Const COLS = 20
    
    bytKOI8 = To_KOI8(&quot;UTF-8-KOI-8.txt&quot;)
    
    'Dump KOI-8 values in hex for examination.
    lngMax = UBound(bytKOI8)
    Do
        With rtbDump 'MultiLine RichTextBox, monospaced font.
            'Dump next Byte.
            bytChar = bytKOI8(COLS * lngRow + lngCol)
            .SelStart = Len(.Text)
            'Highlight digits, punctuation, etc.
            If bytChar &lt; &amp;H80 Then .SelColor = vbRed
            .SelText = Right$(&quot;0&quot; &amp; Hex$(bytChar), 2)
            
            'Next (and finish formatting Byte).
            lngCol = lngCol + 1
            .SelStart = Len(.Text)
            .SelColor = vbBlack
            If lngCol &lt; COLS Then
                .SelText = &quot; &quot;
            Else
                lngCol = 0
                lngRow = lngRow + 1
                .SelText = vbNewLine
            End If
        End With
    Loop Until COLS * lngRow + lngCol &gt; lngMax
End Sub

Here the file is in the program's current directory but you could supply a fully qualified path.

Note that the result of translation to KOI-8 is a Byte array. There is a reason for this: it has to be.

As far as VB6 is concerned we no longer have String characters, but instead some binary data. While you may get away with coercing it back to a String value and then using native text I/O to coerce it back to the original bit pattern on output... you take a small risk.

Since your current locale and codepage won't change between operations though you may well luck out. I.e. it isn't proper but you'll probably get away with it... or not.

Instead you should really be using Byte arrays and binary I/O to write your "ANSI" (actually mixed-codepage) RTF file to disk. Yes, this means all of your "truly ANSI" text needs to be explictly converted as well. This is quite easy, and VB6 provides a handy StrConv(x, vbFromUnicode) function for just such purposes.

I'll have to reverse the translation and see if I get the original results back now.

dilettante · Aug 30, 2008

Ok, one glaring mistake. I forgot to skip over the BOM bytes when retrieving the resulting KOI-8 values.

Unicode BOMs are a whole other subject, but they are an optional 2 to 4 byte preamble used in storing Unicode files and streams. UTF-8 uses a 3-byte BOM.

Code:

Option Explicit

Private Function To_KOI8(ByVal File As String) As Byte()
    Dim stmXlate As ADODB.Stream
    Dim strText As String
    Dim bytKOI8() As Byte

    Set stmXlate = New ADODB.Stream
    With stmXlate
        .Open
        
        'Get UTF-8 Russian text as UTF-16LE VB String from File.
        .LoadFromFile File
        .Type = adTypeText
        .Charset = &quot;utf-8&quot;
        strText = .ReadText(adReadAll)
        
        'Empty Stream.
        .Position = 0
        .SetEOS
        
        'Translate UTF-16LE Russian text to KOI-8.
        .Charset = &quot;koi8-r&quot;
        .WriteText strText, adWriteChar
        
        strText = &quot;&quot;
        
        'Fetch KOI-8 text as Byte array.
        .Position = 0
        .Type = adTypeBinary
        .Position = 3 'Skip BOM bytes.
        bytKOI8 = .Read(adReadAll)
        
        .Close
    End With
    To_KOI8 = bytKOI8
End Function

Private Sub From_KOI8(ByVal File As String, ByRef bytKOI8() As Byte)
    Dim stmXlate As ADODB.Stream
    Dim strText As String

    Set stmXlate = New ADODB.Stream
    With stmXlate
        .Open
        
        'Store KOI-8 bytes into Stream
        .Type = adTypeBinary
        .Write bytKOI8
        
        'Retrieve text as UTF-16LE String translated from KOI-8.
        .Position = 0
        .Type = adTypeText
        .Charset = &quot;koi8-r&quot;
        strText = .ReadText(adReadAll)
        
        'Empty Stream.
        .Position = 0
        .SetEOS
        
        'Save UTF-16LE VB String as UTF-8 to File.
        .Charset = &quot;utf-8&quot;
        .WriteText strText, adWriteChar
        strText = &quot;&quot;
        .SaveToFile File, adSaveCreateOverWrite
        
        .Close
    End With
End Sub

Private Sub Form_Load()
    Dim bytKOI8() As Byte
    Dim lngMax As Long, lngRow As Long, lngCol As Long
    Dim bytChar As Byte
    Const COLS = 20
    
    bytKOI8 = To_KOI8(&quot;UTF-8-KOI-8.txt&quot;)
    
    'Dump KOI-8 values in hex for examination.
    lngMax = UBound(bytKOI8)
    Do
        With rtbDump 'MultiLine RichTextBox, monospaced font.
            'Dump next Byte.
            bytChar = bytKOI8(COLS * lngRow + lngCol)
            .SelStart = Len(.Text)
            'Highlight digits, punctuation, etc.
            If bytChar &lt; &amp;H80 Then .SelColor = vbRed
            .SelText = Right$(&quot;0&quot; &amp; Hex$(bytChar), 2)
            
            'Next (and finish formatting Byte).
            lngCol = lngCol + 1
            .SelStart = Len(.Text)
            .SelColor = vbBlack
            If lngCol &lt; COLS Then
                .SelText = &quot; &quot;
            Else
                lngCol = 0
                lngRow = lngRow + 1
                .SelText = vbNewLine
            End If
        End With
    Loop Until COLS * lngRow + lngCol &gt; lngMax
    
    From_KOI8 &quot;UTF-8-KOI-8-dup.txt&quot;, bytKOI8
End Sub

dilettante · Aug 30, 2008

Note also that UTF-8 files often use just vbLf line delimiters.

MakeItSo · Sep 1, 2008

Sorry for posting back so late, I was offline for the weekend.

@Bob: Yes, I already found these posts before when searching TT for related topics. I had also already tried encoding the characters with \uN, with N being the character's value.
The outcome was not satisfactory either.
As I was not interested in actually displaying the contents in a richtextbox, I was hoping it could suffice as a "text to RTF translator/interpreter". Alas, it obviously doesn't.
[sadeyes]

@dilettante: Thank you very much! Your posts do clarify a lot!
Now I know why I was struggling so hard. I'll try out your binary approach as soon as possible and will post back about the outcome.
[thumbsup]

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

MakeItSo · Sep 1, 2008

This thing is driving me nuts!
[banghead]

OK, good news first: thanks dilettante, you are no dilettante at all, but a wise and knowledgeable Tek-Tipper indeed! *bows down*
I only adjusted your read/write code slightly (col-width, filenames, not confined to KOI-8 but including several codepages, etc.), and it runs like olive oil!
[thumbsup2]

I was now able to create an RTF which seemingly holds the correct characters.
Seemingly.
Alas, it still does not open correctly.
[ponder]

What I am doing now is this:
Instead of outputting to an RTF box (was only a workaround attempt anyway), I am writing directly to a text file; either the plain characters or - if special characters - their decimal unicode value:

Code:

Do
            'Dump next Byte.
            bytChar = bytOtherCP(COLS * lngRow + lngCol)
            If bytChar &lt; &amp;H80 Then 'normal text
                es = Chr(bytChar)
            Else
                es = &quot;\u&quot; &amp; bytChar &amp; &quot;?&quot;
            End If
            s = s &amp; es
            lngCol = lngCol + 1
            If lngCol &gt;= COLS Then
                lngCol = 0
                lngRow = lngRow + 1
                s = s &amp; vbNewLine
            End If
            
    Loop Until COLS * lngRow + lngCol &gt; lngMax
    
    a = FreeFile
    Open List1.List(i) &amp; &quot;.tmp&quot; For Output As a
        Print #a, s
    Close a

After that, I open the newly created tmp file for reading, do some regexp replacements and then create the RTF from it, with an RTFHeader suiting the chosen codepage settings.

Now, the outcome is somewhat strange: The RTF opens in Word, I see all normal text as normal text, alas the Russian characters appear as the 1252 equivalents of codepage 1251.
If I copy all the Å, Î, Ð... junk to a plain text file,save it, and open this text file in Word, Word will detect that it is KOI-8 encoding and display the wonderfully correct Russian text!
Why doesn't it display that correctly right away? In My RTF file?
[3eyes]

I even have now set a language tag "\lang" followed by the respective language code (1049 for russian) before each encoded Russian character!!

Code:

    RTFHead = Replace(RTFHead, &quot;ansicpg1252&quot;, &quot;ansicpg&quot; &amp; CPs(Langmrk))
    RTFHead = Replace(RTFHead, &quot;\fcharset0&quot;, &quot;\fcharset&quot; &amp; CharSets(Langmrk))
    RTFHead = Replace(RTFHead, &quot;deflang1031&quot;, &quot;deflang&quot; &amp; LangCodes(Langmrk))

a = FreeFile
    Open List1.List(i) &amp; &quot;.tmp&quot; For Input As a
    
    b = FreeFile
    Open List1.List(i) &amp; &quot;.rtf&quot; For Output As b
        Print #b, RTFHead
    Do Until EOF(a)
        Line Input #a, es
        With rx
            If .Test(es) Then
                s = .Replace(es, myFormat &amp; es)
            Else
                s = &quot;{\lang&quot; &amp; LangCodes(Langmrk) &amp; &quot; &quot; &amp; es
            End If
        End With
        Print #b, s &amp; &quot;\par}&quot;
    Loop
    Print #b, s &amp; &quot;\par}}&quot;

This is a typical output file as created by my program:

{\rtf1\ansi\ansicpg1251\deff0\deflang1049{\fonttbl{\f0\fnil\fcharset204 Tahoma;}}
\viewkind4\uc1\pard\f0\fs17

{\lang1049 \u123?\u234?\u222?\u111?...\par}}

According to RTF specs, characters can be encoded by a "\u" followed by the decimal unicode value and a concluding "?".
And they ARE the correct characters - just not displayed correctly...

Why?

I am beginning to hate RTF...
[flush2]

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

dilettante · Sep 1, 2008

I'm at a loss.

Hmm... can you find an example of an RTF file of the precise type you want to create? Then examine it with a hex viewer/editor program?

Maybe there is some subtlety that escapes you here. Comparing working examples against the specs can sometimes clear things up a whole lot.

MakeItSo · Sep 2, 2008

I have looked at the code of several Russian RTFs. There ARE differences, but they don't offer any clue to me:

=>The ansicpg value stays at 1252 (western european), the switching over is done by the Word engine as soon as the \lang1049 marker appears.

=>The encoding of the characters is not saved in unicode but in ANSI values, so \u240? turns into \'cf; which is decimal 207. [ponder]

This is why is was wondering from the beginning, how RTF determines the character as 207, although it is 240 in the Russian character set.
[3eyes]

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

strongm · Sep 2, 2008

But all you nneed to do is translate from a good, correct UTF-8 file to an RTF file, is that correct?

MakeItSo · Sep 2, 2008

Almost, yes. The UTF-8 file is XML. I also need do apply some formatting to its TAGs, which I do by using a regexp and adding a "{\cs18\f2\cf15 " formatting info.

What I am basically trying to do is to replace one of my Word macros by a word-independent VB process.
The macro works perfectly fine, but I sometimes have to process thousands of files. The Word engine is simply too slow for my gusto.

I thought "Hey, RTF is a text format. Can't be that hard now, can it?"
Well, it IS...
[tongue]

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

strongm · Sep 2, 2008

Well, this works for me on a UTF8 text file containing a mix of Russian and English text. You'll need to add a reference to tom (the Text Object Model)...

Code:

[blue]Option Explicit

Private Declare Function SendMessage Lib &quot;user32&quot; Alias &quot;SendMessageA&quot; (ByVal hwnd As Long, ByVal wMsg As Long, ByVal wParam As Long, lParam As Any) As Long
Private Const WM_USER = &amp;H400&amp;
Private Const EM_GETOLEINTERFACE = (WM_USER + 60)

Private Sub Command1_Click()
    Dim myIUnknown As Object
    Dim TextRange As ITextRange
    Dim tomDoc As ITextDocument
    
    SendMessage RichTextBox1.hwnd, EM_GETOLEINTERFACE, 0&amp;, myIUnknown
    Set tomDoc = myIUnknown
    tomDoc.Open &quot;c:\russian.utf8.txt&quot;, tomText, 1251
    tomDoc.Save &quot;c:\russian2.rtf&quot;, tomRTF + tomCreateAlways, 1251
    RichTextBox2.LoadFile &quot;c:\russian2.rtf&quot;
End Sub[/blue]

MakeItSo · Sep 2, 2008

Thanks a lot, strongm!
Looks really promising!
Alas: the outcome is wrong...

This is the a word in the KOI-8 text file:
?????

If I open the RTF created with TOM in MS Word, this is what the word turns into:
?????

[sadeyes]

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

MakeItSo · Sep 2, 2008

Got it solved!!!!!!!!!!!!!

Strongm: The problem is the codepage. I just had an idea and BANG!
Codepages have a different numbering than their "codepage" value.
==>I recorded a macro in Word saving a text to KOI-8.

The TOM encoding must be set to 20866, not 1251.
:-D

Thanks to all for sticking through!

[navy]"We had to turn off that service to comply with the CDA Bill."[/navy]
- The Bastard Operator From Hell

strongm · Sep 2, 2008

Sorry, I thought KOI8 was just an interim format. My next advice was going to be to use the alternative 'InternetEncoding' code page numbers - but I see you've figured that bit out for yourself; 21866 should also be valid.

strongm · Sep 2, 2008

For those interested the codepage values being used here(and the InternetEncoding values) can be found under the following Registry key: HKEY_CLASSES_ROOT\MIME\Database\Charset

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Codepage conversion - HEX value problem 2

Programmer

MIS

Programmer

Programmer

Instructor

MIS

MIS

MIS

Programmer

Programmer

MIS

Programmer

MIS

Programmer

MIS

Programmer

Programmer

MIS

MIS

Similar threads

Log in

Part and Inventory Search

Sponsor