Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using Arabic Script in a C program

Status
Not open for further replies.

marxs

Programmer
Feb 19, 2009
6
Hi, I would like to write a program to keep track of a somewhat large amount of Arabic vocabulary words and their English equivalents. The required functionality of this program is quite simple, but I am having major difficulties with Arabic script. I have never used Unicode in a C program before, and although I have done a large amount of research, I have not yet been able to display a single Arabic character on the screen. I have tried wide and multibyte characters under UTF-8 and 16 and UCS2, many of the functions in wchar of course, but there is something I have misunderstood, or am lacking altogether in my attempts. I have been able to print wide characters using their unicode codes for the basic latin alphabet, and about 100 or 200 symbols thereafter, but at a certain point the characters begin to repeat the same sequence over and over again. Can anyone lend me some help with this problem? Thank you.
 
> but at a certain point the characters begin to repeat the same sequence over and over again.
Sounds like a bug in your code, nothing more nothing less.

Bugs in your memory allocation, or bugs in your string handling are common.


--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
 
You've misunderstood the problem i'm having. I dont have any code, I havent begun to write a program yet. Before I can do that, I have to figure out how to display arabic characters in C, which I have not yet been able to do. Not because of any certain bug, but rather a lack of experience in using anything but the latin alphabet in a program. I have made numerous attempts, and I ask here for advice, pointers, perhaps from someone who has used C to display international characters and who knows the area better than I.
 
> but at a certain point the characters begin to repeat the same sequence over and over again.
This implies you've got code which doesn't work.
So post it.

The fact that you've managed to display some characters successfully means you're probably on the right track. The rest is detail.


--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
 
Where are you printing them? Is it on the console or on a graphical program.

Is this for Linux or Windows?

Each arabic character has 4 forms: standalone, buddy on the left, buddy on the right and buddy on both sides. These should be in the Unicode character set. Normally when you type them in, they will be in the standalone form (0x61F-6DF). The printing forms are from FE70-FEFC.
 
Oh yes, this is on the console in Windows or Linux, ive tried both. Here is my code, or what is left of a large amount of experimentation:

Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <stddef.h>
#include <wchar.h>

#define NUM_ARGS 1
#define INPUT_LENGTH 256
FILE *popen(const char *command, const char *type);
FILE *input = NULL;

int main(int argc, char * argv[])
{
  
  setlocale(LC_ALL, "ar_AE");
  
  //setlocale(LC_ALL, "ar_AE.utf8");
  //fwide(stdout, 1);
  //wchar_t c = \u0639;
  
  //1607  When wchar is used instead of an int, many of the symbols are
  // displayed as the same 'u' character with an umlaut.
  
  int b = 2290;
  int i = 0;

  // This displays (most of) the basic latin alphabet 0x0000-0x007F
  for(i = 33; i <= 126; i++){
    fprintf(stdout, "%c  ", i);
  }

  fprintf(stdout, "\n\n\nBreak\n\n\n\n");

  // This displays most of the Latin-1 supplement 0x0080-0x00FF
  for(i = 49825; i <= 50000; i++){
    fprintf(stdout, "%c  ", i);
  }

  fprintf(stdout, "\n\n\nBreak\n\n\n\n");

  // This character does not display
  fprintf(stdout, "\n\n\n%c\n", 639);

  // This loop merely tries to display characters above basic latin-1
  // supplement, but repeats the same string of basic latin.
  for(i = 50342; i <= 60000; i++){
    fprintf(stdout, "%c  ", i);
  }
  
  return EXIT_SUCCESS;
  
}

The repeating sequence of characters and the repeating umlauted u's led me to believe that I lack a certain font or something, but this happens when I try printing Arabic characters in the same way as I have been above as well, and the computers im working on have properly working Arabic fonts and capabilities.
 
Using a text editor, I saved a simple file containing a ? character, and saved in a variety of encoding formats.
Code:
test2-ucs2.txt
000000 48 00 65 00 6c 00 6c 00 6f 00 0a 00 2c 06 0a 00  >H.e.l.l.o...,...<
000010
test2-utf-16be.txt
000000 00 48 00 65 00 6c 00 6c 00 6f 00 0a 06 2c 00 0a  >.H.e.l.l.o...,..<
000010
test2-utf-16le.txt
000000 48 00 65 00 6c 00 6c 00 6f 00 0a 00 2c 06 0a 00  >H.e.l.l.o...,...<
000010
test2-utf8.txt
000000 48 65 6c 6c 6f 0a d8 ac 0a                       >Hello....<
000009
$ cat test2-utf8.txt
Hello
?
$

I then replicated the same, using program code.
Code:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main ( ) {
    int     n;
    wchar_t myChar = 0x062c;
    FILE    *fp;

    // I don't have AE, but it's just the utf8 we're after
    if ( setlocale(LC_ALL, "en_GB.utf8") == NULL ) {
        fprintf(stderr,"Failed to set locale\n" );
        return 1;
    }

    fp = fopen("test2-byprog.txt","w");
    if ( fp == NULL ) {
        perror("Unable to open file");
        return 1;
    }

    n = fwide( fp, 1 );
    if ( n <= 0 ) {
        fprintf(stderr,"Failed to set wide mode, result=%d\n", n );
        return 1;
    }

    //fwprintf(fp, L"Hello\n\u062c\n" );      // universal character encoded in string
    //fwprintf(fp, L"Hello\n%C\n", myChar );  // uppercase-C for a single wide char
    fwprintf(fp, L"Hello\n%lc\n", myChar ); // modifier on lowercase-c

    fclose( fp );
    return 0;
}
Any of the three fwprintf() lines will have the same effect, so you have a number of choices as to how generate your output strings.

The resulting file matches the UTF8 encoded file from step 1.
Code:
$ gcc -std=c99 prog.c ; ./a.out ; od -Ax -t x1z test2-byprog.txt
000000 48 65 6c 6c 6f 0a d8 ac 0a                       >Hello....<
000009
$ cat test2-byprog.txt 
Hello
?
$
I think the key thing to note is that you need to use the 'w' (wide) output functions. Using the old narrow functions with %c say will just truncate anything to being a single byte.

--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
 
Thank you very much for the help. I ran the code on Linux as well as Windows XP, but niether displayed the jeem character. Under Linux with either "en_GB.utf8" or "ar_AE.utf8" 'd8 ac' was displayed as ج, as 'd8' and 'ac' in basic latin supplement. Windows did not recognize the same locales, but under the "arabic" locale, the following was displayed:

Code:
Hello
u062c
Hello
,
Hello
Ì

Im confused as to why this is, do you have any idea? Thanks again for the help. Also, I should mention: I am running Linux on Windows XP, im not sure if that makes a difference.
 
> was displayed as ج, as 'd8' and 'ac' in basic latin supplement.
Well it certainly isn't being interpreted as a UTF8 stream in that case.

For your Linux box, enter the 'locale' command at the prompt. I'm guessing it's just the 'C' locale. Here, it's en_US.utf8

My Linux rig at the moment is a vmware instance of Ubuntu 8.04.
gcc is 4.2.4

Which compiler/version are you using on windows?


--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
 
Do you have "Install files for complex scripts" from Regional Settings installed? You can't display joined up Arabic without it.

 
Its using the POSIX locale, and 'locale -a' returns a million locales including all those used above. Im also using a vmware image, gcc is 3.4.6.

On Windows im using Microsoft Visual C++ 6.0. Also, yes, the files for complex scripts have been installed under Regional Settings, and Arabic works properly on both Linux and Windows.
 
I'm pretty sure you're going to need a UTF8 locale to be able to simply print UTF8 encoded text streams.

As for windows, VC6 is pretty old (it was released over a decade ago).

Simple "express" versions of more up to date Microsoft compilers' are available for free.
The 2008 version being the latest. I would guess this has much better locale information.


--
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
 
Ive downloaded the 2008 express edition, but with only a slightly different result, when running the same code under locale 'arabic' the following is printed to file:

Code:
Hello
Ì
Hello
,
Hello
Ì

It still fails to load any .utf8 locale I can think of; I dont know how to find a list of the available locales on windows. What I dont understand, however, is why this fails on linux as well, where the locales (utf8 or not) are definitely available, and setting a certain utf8 locale does not fail, yet the characters are still printed strangely and not according to utf8.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top