Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

wchar_t and unicode help

Status
Not open for further replies.

mrampton

Programmer
Jan 11, 2001
7
US
I'm having some trouble getting a line of unicode printed to the screen properly.

I'm calling:
wc_message(L">¡¢£¤¥¦§¨©ª«­®¯");

Where wc_message() is:
void wc_message (wchar_t *text)
{
wchar_t mstring[256];
swprintf(mstring, L"MESSAGE: %s\n", text);
wprintf(mstring);
fwprintf(test_log, mstring);
}

When the text is printed:
I get --MESSAGE: >íóúñѪº¿¬¬½¡«»
instead of -- MESSAGE: >¡¢£¤¥¦§¨©ª«­®¯")

I've never worked with unicode before so maybe I'm making a simple mistake -- sorry if that happens to be the case. Does anyone have any ideas on this though?

Thanks in advance.
-mark
 
I would think it is more of a font issue. Extended ascii character can be different in other fonts.

Matt
 
I don't really understand what you're trying to do?
If you want the exact same characters you input to be printed, why not just use printf()?
I've never used unicode either, but I thought wide characters were used for non-ASCII characters like Chinese...

But it looks like Zyrenthian is right. Open your Windows Character Map program and compare the characters you typed in System font to those in Terminal font.
 
I don't know that Unicode characters are valid in C++ source files. Try reading the Unicode characters in from another file, then printing them out; see if that works any better.


cpjust: wchar_t is typically used to represent characters in the Universal Character Set, or Unicode. While Unicode does have Chinese characters, as well as characters from Hebrew, Japanese, Cyrillic, and several other langauges, it also contains all of the characters in the ASCII character set. It would be hard to have a Universal Character Set that ignores the characters widely used in the USA and in most of Europe.
 
I'm a little more familiar with UTF8, where it is exactly the same as a char* string for only English strings, and for other characters that are not in the ASCII range, it uses 2 or more bytes for each character.
Is that the way it works with Unicode also, or are English characters also multiple bytes?
 
Well, UTF-8 is a particular encoding of Unicode.


Unicode is a character set. That means it is a list of characters and their "positions" within the list.

ASCII is also a character set; it, for example, specifies that the character 'A' is at "position" 0x41.


An encoding is a way of representing characters from a particular character set in a stream of data.

Now, with ASCII, you only have 128 positions. You can represent each position as an integer in a single byte, so you use a sorta "one-to-one" encoding: each byte in the data stream represents a single ASCII character.

With Unicode, there are enough "positions" to require two bytes to represent one as an integer. You could use a "one-to-one" (or "two-to-two"?) encoding of Unicode that uses two bytes to represent each character. Such an encoding exists, and it is called UCS-2.


However, there are some problems with that encoding. First of all, there's the size: it would essentially double the size of any "normal" document to create it using UCS-2. Second, if you wanted to use it for all your files, you'd have to translate all your ASCII files to UTF-8.

UTF-8 is an Unicode encoding (whose mechanics you already seem familiar with) that remains backward-compatible with ASCII. That is, the Unicode characters that are also ASCII characters are represented by the corresponding ASCII bytes, but everything else is represented by a combination of more than one byte.

This has the advantage that you can read an ASCII file as if it were encoded in UTF-8 and get the correct result. It also means that, for documents that mostly use ASCII characters and occasionally use, say, Greek characters for equations, it can represent the most commonly used characters as single bytes.


I believe that other langauges have similar encodings that encode their most commonly used characters as single bytes as well. KOI-8 is an encoding commonly used in Russian-speaking areas. I believe it represents Cyrillic characters as single bytes, making Cyrillic documents encoded in KOI-8 shorter than corresponding UTF-8 ones.


That's my understanding, anyway.
 
And from what I can see, Tek-Tips didn't seem to want to attach my name to that post...

The previous post/ramble is by me, chipperMDW, in case anyone cares.
 
...translate all your ASCII files to UTF-8"

sould read

"...translate all your ASCII files to UCS-2"


Sorry.
 
chipperMDW: I've been using UTF8 encoding for all the strings in the project I'm currently working on, which is how I became familiar with UTF8, and now I know a little more about it. Thanks!
 
No problem.


As for the original question, I checked a draft of the C++ Standard, and it seems Unicode characters are valid in C++ source.

In that case, I second the suggestion that the font is the problem.
 
I have never tryed to output any Unicode to the screen too, but some thoughts anyway.

1. Windows have certain problems with theyr regional settings. MSVC editor is 8-bit editor. Special characters you entered will be converted by compiler to theyr Unicode equivalents using current regional settings. However in editor window they are represented using initial regional settings, choosed by Windows installation. So, you could have problems, if you have changed your reg. settings after installation of Windows. I think, you should rather use additional UTF-16 little endian Unicode file, containing all messages you want to use, and load them to your program before use.

chipperMDW:
As for the original question, I checked a draft of the C++ Standard, and it seems Unicode characters are valid in C++ source.

What coding? UTF-16 or UTF-8? In case of latter - how compiler would distinguish between UTF-8 and simply 8-bit files? MSVC compiler is simply 8-bit compiler and neither UTF-16, nor UTF-8 cannot be simply used.

cpjust:
I've been using UTF8 encoding for all the strings in the project I'm currently working on, which is how I became familiar with UTF8, and now I know a little more about it.

Where? In L"" litterals? What compiler are you using?

I think, UTF-8 characters could be used only in ordinary char string litterals, but they should be converted to wchars, using some converting routines then.

2. If your output is made to console, it will be encoded according to correspondence between Windows/DOS code pages.

3. Windows system raster fonts have no Unicode support, so, if you output your message to window using raster font instead of true type font, output will be also converted according to your regional settings.

4. The right output should be in your log file. Try to load it to any Unicode editor.
 
I'm using Visual Studio 6.0.
I'm using English text most of the time, so it's the same as a normal char* string.
For High Ascii and Japanese text, I use the hex values. Ex.
Code:
const char* HIGH_ASCII = "/xC3/x8A/xC3/xAF/xC3/x8F/xC3/xB4/xC3/x94/xC3/xB9/xC3"
"/x99/xC3/xA4/xC3/x84/xC3/xB6/xC3/x96/xC3/x9F/xC3/xBC"
"/xC3/x9C/xC2/xA7/xC2/xB5/xC3/xA0/xC3/x80/xC3/xA8/xC3"
"/x88/xC3/xA9/xC3/x89/xC3/xB2/xC3/x92/xC3/xA1/xC3/x81"
"/xC3/xAD/xC3/x8D/xC3/xB1/xC3/x91/xC3/xB3/xC3/x93/xC3"
"/xA5/xC3/x85/xC3/xA4/xC3/x84/xC3/xB6/xC3/x96";
which is equivalent to "åÅæÆøØàÀâÂçÇéÉêÊïÏôÔùÙäÄöÖßüܧµàÀèÈéÉòÒáÁíÍñÑóÓåÅäÄöÖ"

When I output the the console it obviously gets messed up for High Ascii and Japanese, but I only test programatically, so I don't need to print to the screen.
 
> What coding? UTF-16 or UTF-8?

The encoding used would depend on what the compiler was written to do, I guess. It could be affected by options or environment.


> In case of latter - how compiler would distinguish between UTF-8 and simply 8-bit files?

I'm not so sure it would need to. An 8-bit file should be readable as a UTF-8-encoded one, shouldn't it? Unless you're really using the 8th bit.


> MSVC compiler is simply 8-bit compiler and neither UTF-16, nor UTF-8 cannot be simply used.

I don't see why being an 8-bit compiler should prevent it from interpreting data as UTF-8 or UTF-16. There may be another reason it cannot use them, though.

Keep in mind that I don't normally use Microsoft's compiler (or operating system, for that matter), so I don't know a lot of detailed information about it.


The C++ Standard (the draft, at least) says that during the compilation process, the compiler should turn all characters not in the source character set (which is mostly ASCII characters) into some interanl represntation. The compiler gets to pick.

Whether or not MSVC does that is antoher issue, I guess, but I thought Microsoft was using UTF-8 in most things by now...
 
That was me again.

I guess when I take too long writing a post, I get logged out and it puts me as a blank when I finally submit it.

Cool; I found a bug.
 
The encoding used would depend on what the compiler was written to do, I guess. It could be affected by options or environment.

UTF-16 could be recognised authomatically - the file has two header bytes. What about UTF-8 and plain 8-bit files - you are right, they could be distinguished using some compiling switches.

Microsoft was using UTF-8 in most things by now...

Where, for example? Long file names seem to be simply 16-bit coded. Text in MS Word .doc files is stored by chunks, some of them are ASCII coded, some - 16-bit, I didn't recognize any UTF-8. MSVC 6.0 compiler by default interprets L"" litterals as 8-bit (non Unicode) strings and converts them to Unicode according to current Windows 8-bit code page. "" litterals with 8-bit characters are left unchanged.

I'm using UTF-8 for output to XML-files. I'm surprised - even accented characters not having theyr equivalents in current True Type font are displayed by the Explorer more or less correctly - they are combined of other characters adding suitable accents over them.
 
> UTF-8 and plain 8-bit files - you are right, they could be distinguished using some compiling switches.

Yes, but I thought the whole point of UTF-8 was so you could read a file containing only ASCII characters and have it be interpreted correcty as UTF-8 data.

When you say 8-bit encoded, are you talking about plain ASCII where only the first 7 bits matter, or are you talking about something that actually uses the 8th bit for extra, non-ASCII characters or formatting or whatnot?

In the latter case, then yeah, you'd need some way to distinguish between the two encodings. But in the former case, from my understanding, it should work fine as a UTF-8-encoded file.


> Where, for example?

I don't have any examples; sorry. I just thought I remembered reading somewhere that Microsoft's preferred encoding was UTF-8, and I know they're pretty good with internationalization support.

Actually, I think whatever I was reading at the time was talking about a filesystem, but you already mentioned that file names are not UTF-8. Perhaps it was NTFS? I dunno; I'll see if I can find what it was that gave me that impression.
 
Yes, plain ASCII and UTF-8 file containing only ASCII characters are identical.

When you say 8-bit encoded, are you talking about plain ASCII where only the first 7 bits matter, or are you talking about something that actually uses the 8th bit for extra, non-ASCII characters or formatting or whatnot?

The second one. 8-bit plain encoding I mean one byte per character encoding, where national letters are represented by codes over 128 using some encoding table like Windows-1252, IBM-437 or KOI-8.

you already mentioned that file names are not UTF-8. Perhaps it was NTFS?

Filesystem where I sow two byte per character encoded file names was FAT in early Windows 95. No idea about NTFS, but
Windows API's manipulating long file names also use 16-bit encoding.
 
> The second one. 8-bit plain encoding I mean one byte per character encoding, where national letters are represented by codes over 128 using some encoding table like Windows-1252, IBM-437 or KOI-8.

Ok, gotcha.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top