Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations John Tel on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Convert Japanese to Ascii 1

Status
Not open for further replies.

MikeAJ

Programmer
May 22, 2002
108
US
Hi, I have a database that contains both Japanese and English characters. I'm trying to convert the Japanese charcters into it's Ascii equivalent so that I would have something like "既" instead of the literal japanese charcter; that way when I print it to a browser so that I can simply declare utf-8 and everything will work fine. Here's the code I'm trying so far to get this done.

DECLARE
v_word VARCHAR2(200);
v_len NUMBER;
v_result VARCHAR2(5000);
BEGIN
SELECT Customer_Name
INTO v_word
FROM Contacts
WHERE Language='JA';
v_len := LENGTH(v_word);
FOR i IN 1..v_len LOOP
v_result := v_result || '&#' || ASCII(SUBSTR(v_word,i,1)) || ';';
END LOOP;
dbms_output.put_line(v_result);
END;

This is returning code that, when viewed in the borwser, appears to be half of each word followed by a ?. For example the first ascii code I get back is "�", which if you view through a browser you will see what I mean. I know that japanese is a dual-byte language, but I don't know how to return the correct ascii code for each word. I tried increasing the substring to two charcters in length, but that didn't help either. Does anyone know how to accomplish this?

Thanks.
 
Good Day,

ASCII specifies 128 characters as you can see at Unfortunately the Japanese character set is not included ;-)

First step in your case, would probably be, finding out what the DB character set is.
You can use:

SELECT SYS_CONTEXT ('USERENV', 'LANGUAGE') FROM DUAL

Regards,
Dan
 
If you put this "入力を省略する" into the source of a webpage and view it in a browser you will see japanese. Maybe I was mistaken and this isn't the ascii code, but how can I convert japanese in my tables to the code above.
 
Hi Mike,

No Doubt – I saw the Japanese characters. These entities (入) are the Unicode values of the characters.

When I view the character � in the browser I see only the question mark.

Can you please be kind enough to run the above SQL statement?

Thanks,
Dan
 
Thanks Dan, The query above returns "AMERICAN_AMERICA.UTF8".
 
Mike,

OK, so your database is configured to hold UTF-8 data. UTF-8 is one implementation of the Unicode character set.

You are right about the fact that Japanese is a multi-byte language. The thing is that when the implementation is UTF-8, we don't know how many bytes each character uses. It can be one byte for an ASCII character and several bytes in other cases.

Can you please show us the output of the following query?

SELECT dump(Customer_Name)
FROM Contacts
WHERE Language='JA'

BTW, Oracle 9i introduced serious support for Unicode.
One of the features is a new datatype called NVARCHAR2, which is used to store Unicode variable-length data. Oracle also introduced several conversion functions, such as UNISTR, COMPOSE and DECOMPOSE.

-- Dan
 
It should be 3 bytes per charcter because I got 36 for lengthb(customer_name) and 12 for length(customer_name), but here's the result of the dump:

Typ=1 Len=36: 230,160,170,229,188,143,228,188,154,231,164,190,227,128,128,230,156,172,231,148,176,230,138,128,232,161,147,231,160,148,231,169,182,230,137,128
 
Hi Mike,

The data looks good to me. Please note that every triple-byte starts with a byte with a value greater than 127. If the first byte was holding a value of 127 or less, it would be a one-byte character. It would be the case when utf-8 and ascii characters have the same bit representation. But obviously that's not our case.

The problem of not displaying these characters correctly with the browser might be with the web server configuration or with the browser set-up.

About the browser, if it IE, View, Encoding should show Unicode (UTF-8).

Regards,
Dan
 
The data does show up in a browser as long as we pass it the correct charset in the page headers. This is a legacy system that I inherited, and we no longer want to store the japanese characters themselves, but rather the unicode equivalent. So I'm trying to see if there's any update command that can be run to change the charcters in the database, or even if at delivery time to a webpage I could change them to unicode with a select statement. That way we could always keep our chrset in UTF-8 and we would no longer have to change our dad and headers. So will any of the functions you mentioned above help in converting these charcters? Thanks for your help.

Mike
 
> and we no longer want to store the japanese characters themselves, but rather the unicode equivalent.

ok, but it seems that your DB is set for utf-8 and the data looks like correct utf-8 encoded data.
As I said before, utf-8 is an implementation of the Unicode character set. You can’t store Unicode codes in the DB – You need to choose between utf-8, utf-16 or ucs-2 (Microsoft’s implementation).

About changing the headers – either you specify the encoding you use in the headers, or you set it up in the browser. But, you need to do one of them.
 
Here is a record exactly as it's stored that appears to have unicode and html in it, so maybe I'm not understanding you.

&#12497;&#12473;&#12527;&#12540;&#12489;<br>
<input type="Password" size="15" name="password">

However, We also have records that store the japanese charcters and html. So, is there a way to convert the japanese into something similar as you see above?
 
I think I understand you now. You have the DB configured for utf-8 but you want to use the Unicode character references which the browsers support.
It looks to me like a waste. Look how many bytes you need to use in order to represent one Japanese letter: &#12497; - 8 bytes. But it is your choice. BTW, What about user input forms?

One good function for conversion would be CONVERT:

-- Dan
 
Mike,

One reason we have here terminology issues is because these Unicode character references are often called HTML Escape sequences -
About the conversion - you can probably do it with the programming language you use to generate the web pages.

-- Dan
 
Mike,

I looked around a bit and here is what I found:

Conversions tables and formulas -
I took your first utf-8 character and applied the formula on it: 230,160,170

ud = (230-224)*4096 + (160)*64 + (170-128) = 34,858

It’s looks good &#34858; – [&#34858;] !

Some code converters:

PHP - Java - C -
Please read the following -
The fellow says:
Just when all western Europe (and the US) had ad-hoc agreed on 8859-1 and we could all get a big push so applications at least didn't strip the eighth bit from strings (this is still in common use, we're far from achieving that goal on Linux distributions), someone decided that it needs to be PERFECT.

Isn't the internet amazing?

Regards,
Dan
 
Just tried the Java program:

1) 34,858 -> Hex 882a, using
2) Placed \u882a in the Java program and got the utf-8 hex value: e8 a0 aa which is 232 160 170 based on this calculator.

So, somehow 230 became 232. Go figure ;-)

-- Dan
 
Dan, thanks for all of your replies. I'm now in the right direction to get this done.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top