Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Reading MultiBytes Charectors using c

Status
Not open for further replies.

yasin007

Programmer
Oct 14, 2003
13
SG
Hi All,

How to read multibytes charecter files like Chinese,Japanese & Korean language content files ?.
Iam just new to C Porgramming ?. Any help highly appreciated!.


 
If you are on VC++, you can go in for w_char library..
Otherwise, you will have to read the characters using short data type rather than char.
 
Don't know much about multibyte characters, but check out the functions:

Code:
mbstowcs
mbtowc
Presumably, you read a string from the multibyte file as a char*, then convert it to a wchar_t*.

Just my guess. Hopefully looking at the documentation will help you more.
 
Hi guys. Thanks for the response .

Iam working in VisualC++ 6.0 . But i need to test this in unix also using gcc compiler . I tried to findout the mbstowcs , mbtowc functions . Is it a built in function or from external library ?

From file , i can able to read as char (getc()) or getw(int) .But how can i read as short. Can u pass a piece of sample.
Pls advice me.

Thanks.
 
Maybe to avoid compilation errors you can do something like this,

typedef short wchar_t;

short buf;
fscanf(filePointer, "%c%c",&((char *)&buf)[0],&((char *)&buf)[1]);

etc etc...

Well If you are on HP-Unix, it has this library of wchar_t etc.. so that you will be able to directly use it...

Otherwise you can also google to find if there is free code for this available..

 
Well I was going through the man pages of Solaris, I found the following APIs related to wide char. (Sorry for the format, I have just copied them from net). Maybe you can go through theor man pages for details.

watoll(3C) - convert wide character string to long integer
wcrtomb(3C) - convert a wide-character code to a character (restartable)
wcscat(3C) - wide-character string operations
wcschr(3C) - wide-character string operations
wcscmp(3C) - wide-character string operations
wcscoll(3C) - wide character string comparison using collating information
wcscpy(3C) - wide-character string operations
wcscspn(3C) - wide-character string operations
wcsetno(3C) - get information on EUC codesets
wcsftime(3C) - convert date and time to wide character string
wcslen(3C) - wide-character string operations
wcsncat(3C) - wide-character string operations
wcsncmp(3C) - wide-character string operations
wcsncpy(3C) - wide-character string operations
wcspbrk(3C) - wide-character string operations
wcsrchr(3C) - wide-character string operations
wcsrtombs(3C) - convert a wide-character string to a character string (restartable)
wcsspn(3C) - wide-character string operations
wcsstr(3C) - find a wide-character substring
wcstod(3C) - convert wide character string to double-precision number
wcstok(3C) - wide-character string operations
wcstol(3C) - convert wide character string to long integer
wcstombs(3C) - convert a wide-character string to a character string
wcstoul(3C) - convert wide character string to unsigned long
wcstring(3C) - wide-character string operations
wcswcs(3C) - wide-character string operations
wcswidth(3C) - number of column positions of a wide-character string
wcsxfrm(3C) - wide character string transformation
wctob(3C) - wide-character to single-byte conversion
wctomb(3C) - convert a wide-character code to a character
wctrans(3C) - define character mapping
wctype(3C) - define character class
wcwidth(3C) - number of column positions of a wide-character code
WIFEXITED(3UCB) - wait for process to terminate or stop
WIFSIGNALED(3UCB) - wait for process to terminate or stop
WIFSTOPPED(3UCB) - wait for process to terminate or stop
windex(3C) - wide-character string operations
wmemchr(3C) - find a wide-character in memory
wmemcmp(3C) - compare wide-characters in memory
wmemcpy(3C) - copy wide-characters in memory
wmemmove(3C) - copy wide-characters in memory with overlapping areas
wmemset(3C) - set wide-characters in memory
wordexp(3C) - perform word expansions
wordfree(3C) - perform word expansions
wprintf(3C) - print formatted wide-character output
wrindex(3C) - wide-character string operations
wscanf(3C) - convert formatted wide-character input
wscasecmp(3C) - Process Code string operations
wscat(3C) - wide-character string operations
wschr(3C) - wide-character string operations
wscmp(3C) - wide-character string operations
wscol(3C) - Process Code string operations
wscoll(3C) - wide character string comparison using collating information
wscpy(3C) - wide-character string operations
wscspn(3C) - wide-character string operations
wsdup(3C) - Process Code string operations
wslen(3C) - wide-character string operations
wsncasecmp(3C) - Process Code string operations
wsncat(3C) - wide-character string operations
wsncmp(3C) - wide-character string operations
wsncpy(3C) - wide-character string operations
wspbrk(3C) - wide-character string operations
wsprintf(3C) - formatted output conversion
wsrchr(3C) - wide-character string operations
wsscanf(3C) - formatted input conversion
wsspn(3C) - wide-character string operations
wstod(3C) - convert wide character string to double-precision number
wstok(3C) - wide-character string operations
wstol(3C) - convert wide character string to long integer
wstostr(3C) - code conversion for Process Code and File Code
wstring(3C) - Process Code string operations
wsxfrm(3C) - wide character string transformation
 
The mbstowcs and mbtowc functions are supposed to be Standard C (but they may be part of the newer Standard, in which case older libraries might not have them). They should be declared in <stdlib.h>.

On a UNIX-ish machine, type &quot;man mbtowc&quot; on the command line to get info about it.


If I understand correctly, all the wc stuff is what you use to manipulate Unicode strings within your program. A multibyte character file is not in Unicode, but a set of (often) single-byte characters and escape codes, where the escape codes change &quot;states&quot; of the file, and state determines the meaning of each byte. In short, it's a more compressed representation than straight Unicode, especially when your data has large sections that use the same character set.

So you read the multibyte character file into a char buffer. You use mbstowcs to convert the multibyte character string (char*) into a wide character string (wchar_t*).

Once you have a wchar_t representation of your file, you can work with it in your program using the wc library stuff mentioned above (a lot of which is Standard).

And, of course, there are the wcstombs and wctomb functions for changing a wchar_t* into a char* of multibyte characters you can write out.
 
Hi ,

I have tried this mlen() function in windows VisualC++.
It can compile & Build . But giving runtime exception(abruptly exited) when executing this mlen() function exactly...

Is this function fine with windows ?.

This is the code...

#include <stdio.h>
#include <stdlib.h>
int main()
{
char v;
int vi =0;
FILE *fp;
printf(&quot;\n Multi Bytes operation starts here.. \n&quot;);

fp = fopen(&quot;test.htm&quot;,&quot;r&quot;);

while ((v = getc(fp)) != EOF)
{
vi = mblen((const char *)v, MB_CUR_MAX );
if(vi==0) printf(&quot;\n String Error \n&quot;);
else if (vi>0) printf(&quot;\n Length: %d\n&quot;,vi);
else if (vi==-1) printf(&quot;\n Invalid multibyte char \n&quot;);
}

printf(&quot;\n finished here!!! \n&quot;);
}
 
You're casting your char to a char* and passing it to mblen. That just gets you a random pointer, which does normally give a runtime error if you try using it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top