Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations John Tel on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

UTF-8 Katakana 1

Status
Not open for further replies.

klamerus

Programmer
Jun 23, 2003
71
US
Help.

We have an application that will generate PDF documents from input data. We have been sending this program Latin-1 (which requires turning the upper 128 bit values of a byte into UTF-8), but we now need to send it katakana data in UTF-8 representation.

We've looked around and see the Unicode values going from 30A0 through 30FF, but we understand that UTF-8 Katakana takes 3 bytes, not 2.

Can anyone provide the specific byte values (in order) for a number of katakana characters that are true UTF-8? I think the ideal format would be a snipped from an HTML file so that we can view these, but we really need to understand how many bytes UTF-8 katakana should be and to get some sample values (if not the entire character set) showing both the values and the character. We're going to have to create a replacement table to convert our source file into this.

Thanks,
 
Now looking around...

According to katakana run from U30A0 to U30FF, as you wrote. That's where they show up in my IME pad, but I'm not sure wether the code is UFT8.

And shows a chart of full-width katakana and codes, covering this range.

And shows half-width katakana from FF71 to FF9D.

But I'm not finding a 3 byte form. Sorry to be unhelpful. Is there a reference that claims the codes should have three bytes?
 
Yeah, I have the pdf document.

My understanding is that all UTF data has the high-bit set (or if it doesn't it's the ASCII data passed through).

My further understanding was that the bits following the high bit indicated how many additional bytes followed in the mode of:

110xxxxx indicated 1 follow-on byte
1110xxxx indicated 2 follow-on bytes
11110xxx indicated 3 follow-on bytes

and so on.

a 30 is only a 00110000, so doesn't include anything introducing it as UTF-8. At least that's my understanding of UTF-8.
 
You have me there!

This huge page claims to have UTF-8 codes.


If you scroll down to where the first column contains "05", you will find katakana. 3 bytes long, starting with xE, so it seems to fit your definition.

Hope this helps,
Walter
 
Hi klamerus,

I only know a little about this so am happy to be shot down.

All that Unicode does is define unique code points for each character. How code points are represented in a file depends on the encoding used, and there are many different encodings. UTF-8 is an encoding method which uses single bytes (hence the 8, for 8 bits). As, clearly, single bytes are not enough, an extra level of encoding has to be introduced to say how many bytes should be considered as a group representing each individual character. So there are, effectively, two separate sets of bit codes within UTF-8, the first simply indicating how the the second should be read.

The first code is actually two different codes, which one being indicated by the first bit. If the first bit is zero there is no information about the number of bytes in use so it is deemed to be a single byte and the remaining seven bits are the actual character encoding (values 0 through 127). When the first bit is a one, multiple bytes are being used - the number of bytes being indicated by the number of ones. The only possible way to end a variable length sequence of ones is with a zero.

So your examples and your understanding are correct - a first byte starting 1110 uses three bytes in total (i.e has two follow on bytes). Also, for no real good reason I know, when a byte is part of a sequence of bytes (as opposed to being a single byte character) it always begins with a (redundant) 1 which must be followed by a terminating zero.

Knowing how many bytes are used, the second code, that is the bits representing the character, can then be extracted. This is a concatenation of the the bits from each byte which follow the first zero in that byte. So, for example, look at the sequence:

[tt]1110 0011 1000 0010 1010 0000[/tt]

The first byte begins with a 1 so this is a multi-byte code.

There are three ones at the start of the first byte so this character uses 3 bytes.

The bits following the first zero in each byte are: [tt]0011 00 0010 10 0000[/tt]

Rearranging this gives: [tt]0011 0000 1010 0000[/tt] which is 30A0 - the first of your Katakana alphabet.

Similarly, the last character in the alphabet (30FF) is: [tt]1110 0011 1000 0011 1011 0000[/tt]

I'm not going to go into a lengthy explanation of the reverse of the above process - for the 90 or so characters you want a translation table is all you really need (and Walter's link above includes the Katakana alphabet) and, yes, each character does take three bytes.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Tony,

Thanks! Who knew that encoding characters could be so interesting?

It does seem wasteful that all of the bytes in a multibyte code begin with 1, but there must be some reason.

Best Regards,
Walter
 
I think I'm just sad [smile] - I surprised myself as I wrote the post and it seemed to make logical sense - glad you found it interesting.

My best guess at the reason (for the leading one) is simply to differentiate the byte so that, in isolation, it is not mistaken for a single byte encoded character - and the overhead in having it is fairly small as only a small range of characters would actually take a byte less without it.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Yes, I'm very familiar with UNICODE and wrote the company standard around it. The difficulty is always creating the UTF-8 from the UNICODE data itself. The algorithm is obtuse so far as I know.

Regardless, one of the people working for me found an absolutely marvelous web site.


It generates the UTF-8 for the UNICODE (and other stuff too). It's exactly what I was looking for. Too bad there's not a tool you can download on this.
 
Well, I think the specification is that the indicator for UNICODE must have the 8th bit set since anything else, is single byte ASCII.

The follow-on several bits indicate the number of follow-on bytes. That's only good for a few more bits though. It's always been the remaining bits that have confused me.

The other bytes are simply what someone choose and are easily looked up in the UNICODE standard.

The first byte in UTF-8 has always been a bit of a puzzle though.
 
One thing I can recommend is the half-width katakana.

While the other (full width) is nicer looking, the half-width takes a smaller character set and it appears to be what the computer systems here (like SAP) are using. I believe this is because it allows for the use of both Latin-1 and katakana (since the katakana takes less of the character set). Anyway, I prefer it.
 
klamerus,

Neat web site.

Judging from the example Tony's post, if the UNICODE value is 2 bytes long, then UTF-8 encoding expands it to 3 bytes. The extra byte is all determined, and added by UTF: 4 bits on the first byte (1110), 2 bits on the second (10), and 2 bits on the third (10). That accounts for all of non-unicode bits.



Best Regards,
Walter
 

Walter,

The number of bits is the significant factor. Characters (above U+7F) up to U+7FF need (maximum) 11 bits and with a 3 bit prefix (110) on the first byte and a two bit prefix (10) on the second byte there are 11 spare bits so these characters need two bytes. Code points U+800 and above (up to U+FFFF) take 3 bytes - first byte prefix 1110, second and third byte prefixes 10, leaving 16 bits for the code point encoding.

klamerus,

It shouldn't be too difficult to write a VBA routine which would take a Unicode value and return a UTF-8 encoding as a byte array - I'll try and knock something up. There are complications with strings as VBA strings are 2-byte Unicode UTF-16 and there are all sorts of automatic conversions done to hide this fact. I don't know exactly what you propose to do with your results so it may not matter but it is worth being aware of it.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Tony,

I see. There are a lot of wasted bits in characters that are just a little too big for 1 byte.

As for a VBA routine, it looks interesting. In VBA's help, I found the function

AscW(s as string) as integer.

It claims to return unicode for the string s. The integer value could be processed with logical operators and integer arithmetic. Is that the way your are thinking?

Best Regards,
W
 
Yes, Walter, that is the way I'm thinking.

Some code follows, but note:[ul][li]Word itself (never mind VBA) doesn't fully cope with code points >= U+10000. It will reproduce them in a document if you have appropriate fonts installed but working with them in other ways may or may not produce correct results.[/li][li]The AscW Function returns an Integer - and some odd results for code points above U+FFFF which an integer cannot possibly represent. Integer values (-32768 to +32767) that are returned also have to be adjusted so that values 32768 to 65535 are treated properly[/li][/ul]
This function (not fully tested - but looks good to me) will return a byte array of the UTF-8 representation of a supplied character
Code:
[blue]Function UnicodeToUTF8(Char As String)

Dim ByteArray() As Byte
Dim CodePoint As Long
Dim CodePointBinary As String
Dim iByte As Integer

CodePoint = AscW(Char)
If CodePoint < 0 Then CodePoint = 65536 + CodePoint

Do While CodePoint >= 2
    CodePointBinary = CodePoint Mod 2 & CodePointBinary
    CodePoint = CodePoint \ 2
Loop
CodePointBinary = CodePoint & CodePointBinary

If Len(CodePointBinary) <= 7 Then
    
    ReDim ByteArray(0)
    ByteArray(0) = 7
    
    For i = 0 To 7
        ByteArray(0) = ByteArray(0) + Right(CodePointBinary, 1) * 2 ^ i
        CodePointBinary = Left(CodePointBinary, Len(CodePointBinary) - 1)
    Next
    
Else
    
    CodePointBinary = Right("0000" & CodePointBinary, ((Len(CodePointBinary) - 2) \ 5) * 5 + 6)
    ReDim ByteArray((Len(CodePointBinary) - 2) \ 5)
    CodePointBinary = String(5 - (Len(CodePointBinary) Mod 6), "1") & "0" & CodePointBinary
    
    iByte = UBound(ByteArray)
    Do While Len(CodePointBinary) > 0
        For i = 0 To 5
            ByteArray(iByte) = ByteArray(iByte) + Right(CodePointBinary, 1) * 2 ^ i
            CodePointBinary = Left(CodePointBinary, Len(CodePointBinary) - 1)
        Next
        ByteArray(iByte) = ByteArray(iByte) + 128
        iByte = iByte - 1
    Loop
    ByteArray(0) = ByteArray(0) + 64

End If

UnicodeToUTF8 = ByteArray

End Function[/blue]
You can test it with something like this
Code:
Sub Test()
a = UnicodeToUTF8(Selection.Range.Text) [green]' or a = UnicodeToUTF8(ChrW(12288)) etc.[/green]
For i = LBound(a) To UBound(a)
    stra = stra & " " & Hex(a(i))
Next
MsgBox stra
End Sub

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Tony,

Wow, let me study this.

As for unicodes higher than 0xFFFF, it looks like they do not occur in Japanese; my "IME pad" only goes up to 0xFFFD.

Thanks,
Walter
 

I'm getting way outside what I know here but I believe that some 'extended ideographs' have code points up to U+20000 and beyond, and in what seems like a kludge reminiscent of early DBCS, there are what are called surrogate pair encodings which use two UTF-8 'characters' for a single code point. Windows only has partial support for surrogate pair encodings and it doesn't extend as far as VBA - I don't know if there are any API routines which would help but I suspect not.

Again, to the best of my knowledge, most normal Japanese, and the Katakana, etc. alphabets have low enough code points for this not to be an issue for most people - and it shouldn't be one for klamerus. I have enough interest to dig a little and if I find any way of improving it I will post back but, at the moment, my code is limited and I felt I should say so.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 

After some reading up, I have discovered that all code points (up to U+10FFFF) are available in VBA and will try to modify my code to deal with it.

For the record, if AscW(Char) is in the range U+D800 to U+DBFF then it is the first of a surrogate pair and AscW(right(Char,1)) (or similar construct to get the second 'character') should give a value in the range U+DC00 to U+DFFF for the second of the pair.

U+D800, U+DC00 = U+10000
U+D800, U+DC01 = U+10001
etc., up to
U+DBFF, U+DFFF = U+10FFFF

I mistakenly said UTF-8 in my last post when I meant UTF-16. I have been looking at, and working in Word (and Windows) which use UTF-16. I am assuming that the same logic holds true for UTF-8 and I can't think of any reason why it shouldn't.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 

UTF-8 has the capacity to encode code points all the way up to a theoretical U+7FFFFFFF and so doesn't need to resort to the two-character kludge of UTF-16. This makes it (relatively) easy to change my code to cope with the higher values (extra code in red) ...
Code:
[blue]Function UnicodeToUTF8(Char As String)

[red]Const HD800 As Long = 55296 [green]' Avoid problems with hex literals ..[/green]
Const HDC00 As Long = 56320 [green]' .. returning negative integers[/green][/red]

Dim ByteArray() As Byte
Dim CodePoint As Long
Dim CodePoint2 As Long
Dim CodePointBinary As String
Dim iByte As Integer

CodePoint = AscW(Char)
If CodePoint < 0 Then CodePoint = 65536 + CodePoint
[red]If CodePoint >= HD800 And CodePoint < HDC00 Then
    CodePoint = &H10000 + (CodePoint - HD800) * 1024
    CodePoint2 = &H10000 + AscW(Right(Char, 1))
    CodePoint = CodePoint + CodePoint2 - HDC00
End If[/red]

Do While CodePoint >= 2
    CodePointBinary = CodePoint Mod 2 & CodePointBinary
    CodePoint = CodePoint \ 2
Loop
CodePointBinary = CodePoint & CodePointBinary

If Len(CodePointBinary) <= 7 Then
    
    ReDim ByteArray(0)
    ByteArray(0) = 7
    
    For i = 0 To 7
        ByteArray(0) = ByteArray(0) + Right(CodePointBinary, 1) * 2 ^ i
        CodePointBinary = Left(CodePointBinary, Len(CodePointBinary) - 1)
    Next
    
Else
    
    CodePointBinary = Right("0000" & CodePointBinary, ((Len(CodePointBinary) - 2) \ 5) * 5 + 6)
    ReDim ByteArray((Len(CodePointBinary) - 2) \ 5)
    CodePointBinary = String(5 - (Len(CodePointBinary) Mod 6), "1") & "0" & CodePointBinary
    
    iByte = UBound(ByteArray)
    Do While Len(CodePointBinary) > 0
        For i = 0 To 5
            ByteArray(iByte) = ByteArray(iByte) + Right(CodePointBinary, 1) * 2 ^ i
            CodePointBinary = Left(CodePointBinary, Len(CodePointBinary) - 1)
        Next
        ByteArray(iByte) = ByteArray(iByte) + 128
        iByte = iByte - 1
    Loop
    ByteArray(0) = ByteArray(0) + 64

End If

UnicodeToUTF8 = ByteArray

End Function[/blue]
I haven't included any error checking which, theoretically, I should in case the second character falls outside the range, but with a source of a glyph in a Word document it won't happen.

I have learnt a lot from this little exercise. Thank you.

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Definitely stay away from UTF-16. MS likes it (so long as they're defining the byte order).

The code definitely looks interesting. I'll need to give it a spin. Seems like something good to have in one of those PDA calculators that are available (next to the computer math and other tech specific views).

While most of the characters people use can be handled in UCS2 (2 byte character set), there are some extended ones and the UTF-8 specification goes from 3-6 bytes I believe (although I've never seen anything in use > 3).
 
UTF-8 actually goes from 1 to 6 bytes but the (so far) highest defined Unicode code point (U+10FFFF) only requires 4 bytes.

And don't start me on MS and byte order [wink]

Enjoy,
Tony

--------------------------------------------------------------------------------------------
We want to help you; help us to do it by reading this: Before you ask a question.
Excel VBA Training and more Help at VBAExpress[
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top