Convert Japanese to Ascii 1

MikeAJ · Dec 28, 2004

Hi, I have a database that contains both Japanese and English characters. I'm trying to convert the Japanese charcters into it's Ascii equivalent so that I would have something like "既" instead of the literal japanese charcter; that way when I print it to a browser so that I can simply declare utf-8 and everything will work fine. Here's the code I'm trying so far to get this done.

DECLARE
v_word VARCHAR2(200);
v_len NUMBER;
v_result VARCHAR2(5000);
BEGIN
SELECT Customer_Name
INTO v_word
FROM Contacts
WHERE Language='JA';
v_len := LENGTH(v_word);
FOR i IN 1..v_len LOOP
v_result := v_result || '&#' || ASCII(SUBSTR(v_word,i,1)) || ';';
END LOOP;
dbms_output.put_line(v_result);
END;

This is returning code that, when viewed in the borwser, appears to be half of each word followed by a ?. For example the first ascii code I get back is "&#15114410;", which if you view through a browser you will see what I mean. I know that japanese is a dual-byte language, but I don't know how to return the correct ascii code for each word. I tried increasing the substring to two charcters in length, but that didn't help either. Does anyone know how to accomplish this?

Thanks.

ddrillich · Dec 29, 2004

Good Day,

ASCII specifies 128 characters as you can see at

http://www.asciitable.com.

Unfortunately the Japanese character set is not included ;-)

First step in your case, would probably be, finding out what the DB character set is.
You can use:

SELECT SYS_CONTEXT ('USERENV', 'LANGUAGE') FROM DUAL

Regards,
Dan

MikeAJ · Dec 29, 2004

If you put this "入力を省略する" into the source of a webpage and view it in a browser you will see japanese. Maybe I was mistaken and this isn't the ascii code, but how can I convert japanese in my tables to the code above.

ddrillich · Dec 29, 2004

Hi Mike,

No Doubt – I saw the Japanese characters. These entities (&#20837

are the Unicode values of the characters.

When I view the character &#15114410; in the browser I see only the question mark.

Can you please be kind enough to run the above SQL statement?

Thanks,
Dan

MikeAJ · Dec 29, 2004

Thanks Dan, The query above returns "AMERICAN_AMERICA.UTF8".

ddrillich · Dec 29, 2004

Mike,

OK, so your database is configured to hold UTF-8 data. UTF-8 is one implementation of the Unicode character set.

You are right about the fact that Japanese is a multi-byte language. The thing is that when the implementation is UTF-8, we don't know how many bytes each character uses. It can be one byte for an ASCII character and several bytes in other cases.

Can you please show us the output of the following query?

SELECT dump(Customer_Name)
FROM Contacts
WHERE Language='JA'

BTW, Oracle 9i introduced serious support for Unicode.
One of the features is a new datatype called NVARCHAR2, which is used to store Unicode variable-length data. Oracle also introduced several conversion functions, such as UNISTR, COMPOSE and DECOMPOSE.

-- Dan

MikeAJ · Dec 29, 2004

It should be 3 bytes per charcter because I got 36 for lengthb(customer_name) and 12 for length(customer_name), but here's the result of the dump:

Typ=1 Len=36: 230,160,170,229,188,143,228,188,154,231,164,190,227,128,128,230,156,172,231,148,176,230,138,128,232,161,147,231,160,148,231,169,182,230,137,128

ddrillich · Dec 30, 2004

Hi Mike,

The data looks good to me. Please note that every triple-byte starts with a byte with a value greater than 127. If the first byte was holding a value of 127 or less, it would be a one-byte character. It would be the case when utf-8 and ascii characters have the same bit representation. But obviously that's not our case.

The problem of not displaying these characters correctly with the browser might be with the web server configuration or with the browser set-up.

About the browser, if it IE, View, Encoding should show Unicode (UTF-8).

Regards,
Dan

MikeAJ · Dec 30, 2004

The data does show up in a browser as long as we pass it the correct charset in the page headers. This is a legacy system that I inherited, and we no longer want to store the japanese characters themselves, but rather the unicode equivalent. So I'm trying to see if there's any update command that can be run to change the charcters in the database, or even if at delivery time to a webpage I could change them to unicode with a select statement. That way we could always keep our chrset in UTF-8 and we would no longer have to change our dad and headers. So will any of the functions you mentioned above help in converting these charcters? Thanks for your help.

Mike

ddrillich · Dec 30, 2004

> and we no longer want to store the japanese characters themselves, but rather the unicode equivalent.

ok, but it seems that your DB is set for utf-8 and the data looks like correct utf-8 encoded data.
As I said before, utf-8 is an implementation of the Unicode character set. You can’t store Unicode codes in the DB – You need to choose between utf-8, utf-16 or ucs-2 (Microsoft’s implementation).

About changing the headers – either you specify the encoding you use in the headers, or you set it up in the browser. But, you need to do one of them.

MikeAJ · Dec 30, 2004

Here is a record exactly as it's stored that appears to have unicode and html in it, so maybe I'm not understanding you.

パスワード<br>
<input type="Password" size="15" name="password">

However, We also have records that store the japanese charcters and html. So, is there a way to convert the japanese into something similar as you see above?

ddrillich · Dec 30, 2004

I think I understand you now. You have the DB configured for utf-8 but you want to use the Unicode character references which the browsers support.
It looks to me like a waste. Look how many bytes you need to use in order to represent one Japanese letter: パ - 8 bytes. But it is your choice. BTW, What about user input forms?

One good function for conversion would be CONVERT:

http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions22a.htm#SQLRF00620

-- Dan

ddrillich · Dec 30, 2004

Mike,

One reason we have here terminology issues is because these Unicode character references are often called HTML Escape sequences -

http://www.bbsinc.com/symbol.html.

About the conversion - you can probably do it with the programming language you use to generate the web pages.

-- Dan

ddrillich · Dec 31, 2004

Mike,

I looked around a bit and here is what I found:

Conversions tables and formulas -

http://www1.tip.nl/~t876506/utf8tbl.html

I took your first utf-8 character and applied the formula on it: 230,160,170

ud = (230-224)*4096 + (160)*64 + (170-128) = 34,858

It’s looks good 蠪 – [蠪] !

Some code converters:

PHP -

http://www.zend.com/codex.php?id=835&single=1

Java -

http://www.jguru.com/faq/view.jsp?EID=137049

C -

http://developer.novell.com/ndk/doc/samplecode/clib_sample/intl_utf8/utf8.c.html

Please read the following -

http://gnomedesktop.org/node/1483

The fellow says:

Just when all western Europe (and the US) had ad-hoc agreed on 8859-1 and we could all get a big push so applications at least didn't strip the eighth bit from strings (this is still in common use, we're far from achieving that goal on Linux distributions), someone decided that it needs to be PERFECT.

Isn't the internet amazing?

Regards,
Dan

ddrillich · Jan 1, 2005

Just tried the Java program:

1) 34,858 -> Hex 882a, using

http://www.afineride.com/dechex.html

2) Placed \u882a in the Java program and got the utf-8 hex value: e8 a0 aa which is 232 160 170 based on this calculator.

So, somehow 230 became 232. Go figure ;-)

-- Dan

MikeAJ · Jan 4, 2005

Dan, thanks for all of your replies. I'm now in the right direction to get this done.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Convert Japanese to Ascii 1

MikeAJ

Programmer

ddrillich

Technical User

MikeAJ

Programmer

ddrillich

Technical User

MikeAJ

Programmer

ddrillich

Technical User

MikeAJ

Programmer

ddrillich

Technical User

MikeAJ

Programmer

ddrillich

Technical User

MikeAJ

Programmer

ddrillich

Technical User

ddrillich

Technical User

ddrillich

Technical User

ddrillich

Technical User

MikeAJ

Programmer

Similar threads

Part and Inventory Search

Sponsor