Latin-1 -> UTF8

ddrillich · Aug 20, 2007

Good Day,

We have a DB defined to be encoded as Latin-1. We would like to convert it to al32utf8. However, we are starting to realize that even though it's defined as Latin-1, we have within it characters from different encodings. For example, cp1252.

Any ideas how to proceed with this task?

Regards,
Dan

ddrillich · Aug 24, 2007

Good Day,

Our DBA ran an Oracle conversion utility that attempted to convert a sample DB. The conversion was defined as ISO-8859-1 to UTF8. During this conversion, Oracle reported about thousands of characters as exceptions.

Here is part of the report -

User : IKMAPP
Table : IKS_BINARYDOCUMENT
Column: IKS_FILENAME
Type : VARCHAR2(1000)
Number of Exceptions : 5
Max Post Conversion Data Size: 137

ROWID Exception Type Size Cell Data(first 30 bytes)
------------------ ------------------ ----- ------------------------------
AAABKvAAIAAAAfYAAB lossy conversion Rollout instructions for Lindi
AAABKvAAnAAABcnAAI lossy conversion Project Manager – 040003588.pd
AAABKvAAnAAABcnAAL lossy conversion Business Infrastructure Techno
AAABKvAAnAAABcoAAA lossy conversion Business Continuity Relationsh
AAABKvAAnAAABcoAAB lossy conversion Infrastructure and Controls Im
------------------ ------------------ ----- ------------------------------

User : IKMAPP
Table : IKS_BINARYDOCUMENT
Column: IKS_MD5CONTENTHASH
Type : VARCHAR2(1000)
Number of Exceptions : 1568
Max Post Conversion Data Size: 31

ROWID Exception Type Size Cell Data(first 30 bytes)
------------------ ------------------ ----- ------------------------------
AAABKvAAIAAAAdqAAD lossy conversion ¹]]’Ì . èå^>[Å@a
AAABKvAAIAAAAehAAB lossy conversion &¨´¦é÷P2È1i’In"
AAABKvAAIAAAAgDAAG lossy conversion nÁ "ø°&®sa’mûÜ0
AAABKvAAIAAAAgmAAD lossy conversion ’= y»ý é*Õ$¤0³

What does it mean - 'lossy conversion' and how do I find the exact character that caused the exception?

Regards,
Dan

ddrillich · Aug 24, 2007

I picked up one of the ROWIDs and ran -

select count(*) from IKMAPP.IKS_BINARYDOCUMENT where ROWID = chartorowid('AAABKvAAnAAABcnAAI');

Got back -

ERROR at line 1:
ORA-01410: invalid ROWID

Maybe the conversion changed the ROWIDs...

ddrillich · Aug 27, 2007

Dan,

Things got a bit clearer when I pointed to the right instance ;-)

We realized two things using the CSSCAN utility -

1) That most of the exceptions are due to binary data stored as varchar2.

2) The rest of the exceptions seem to be cp1252 related. Meaning, cp1252 characters reached our DB, even though it's defined to be Latin-1.

We found the values using queries such -

select ascii(substr(IKS_FILENAME,31,1)) from IKMAPP.IKS_BINARYDOCUMENT
where rowidtochar(ROWID) = 'AAABKvAAIAAAAfYAAB';

ASCII(SUBSTR(IKS_FILENAME,31,1))
--------------------------------
146

Any suggestions how to proceed? One thing that I wonder about is whether we can redefine the DB to have the cp1252 character set. After all, cp1252 is a Microsoft extension of Latin-1.

Regards,
Dan

SantaMufasa · Aug 31, 2007

Dan,

As I have mentioned in several other character-set-conversion threads, the only way that I have ever found success in converting an existing database is to:

1) create a new, completely separate database that has al32utf8 character-set definition.
2) do user exports from your Latin-1 database.
3) import into the al32utf8 database the dump files from the Latin-1 database.

Let us know your findings.

[santa]

Mufasa
(aka Dave of Sandy, Utah, USA)
[I provide low-cost, remote Database Administration services: www.dasages.com]

ddrillich · Sep 3, 2007

Dave,

Thank you for the information!

One of the issues we have is that even though the character set of the DB is defined as Latin-1, we have within it characters from the cp1252 character set and we don't know how to deal with these characters.

It seems that the root cause of our problem is explained at -

http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10749/ch3globenv.htm#sthref205

In our case, we have the following setting -
NLS_LANG - AMERICAN_AMERICA.WE8ISO8859P1
NLS_CHARACTERSET - WE8ISO8859P1

Meaning, both are Latin-1. The document says the following about this case - When these settings are the same, Oracle assumes that the data being sent or received is encoded in the same character set as the database character set, so validation or conversion is not guaranteed.

Any way out of this situation?

Regards,
Dan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Latin-1 -> UTF8

ddrillich

Technical User

ddrillich

Technical User

ddrillich

Technical User

ddrillich

Technical User

SantaMufasa

Technical User

ddrillich

Technical User

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Latin-1 -&gt; UTF8

ddrillich

Technical User

ddrillich

Technical User

ddrillich

Technical User

ddrillich

Technical User

SantaMufasa

Technical User

ddrillich

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor

Latin-1 -> UTF8