Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Encoding issue on upgrade

Status
Not open for further replies.

Crundy

Programmer
Jul 20, 2001
305
GB
Hi all,
I moved a site across from one server running MySQL 4 to another running MySQL 5 and forgot to check the collation / encoding defaults on the servers. I'm therefore getting odd characters showing up in my web pages. As data has been entered into the new system since the restore I don't think I can redo the import. Is there a way to fix the encoding issue without losing any new data?

C:\DOS:>
C:\DOS:>RUN
RUN DOS RUN!!
 
I would say probabbly.
Off the top of my head you would need to identify what the collation/encoding sequence is on the old and the new.
You only see odd characters and not complete garbage which is good, in general ASCII is ASCII (e,g, I bet the text fred looks the same on both machines)just a few characters change.
Get hold of the representations in both lists, this might not be as easy as I think and might need some google dtyle research. Once you have the two lists compare them, side by side and identify which characters are differecnt. Once you have this we can move onto putting in right.
I think you have three methods avaiable to you. Firstly you could write a program (PHP might do if a little messy with loe level stuff, but it should be possible.) which reads each table and byte for byte in each text field replace the relevant bytes. You might be able to create a generic script rather than a script for each table. The logic would be something like (for each text based value)
Code:
for i = each byte in string
 if char(string[i]) = "an old collation value which is different"
 then 
    char(string[i]) = "the new collation value which is the old value"
next i
In essence you know the bytes which have changed, and you know what it was and what it now needs to be so you should be able to swap them.

Secondly you could dump the database and use the output files to do the same logc as above.
Which method you use depends on how long you can have the database doen, how big it is, how many tables/strings there are.
Thridly you might be able to manipulate the values using some SQL or a stored proc, I would have to research that.
Do you have aby numeric fields which are spoilt ?
.
If you can post the collation/char sets you have I'd be interested to see the differecnes.
And as ever everyone will be happy to help out.
 
Hello,
The original dump was in latin1, and I did specify this in the mysqldump command:

/usr/bin/mysqldump -pmypassword --default-character-set=latin1 --port=3306 --host= -u myuser databasename

And in the past this has been fine when restoring. There are loads of different characters, in particular foreign accents, so I can't really go through character by character.

Is there an easier way to change the encoding of the table and retain the characters or is the data screwed now?

C:\DOS:>
C:\DOS:>RUN
RUN DOS RUN!!
 
Hmm, using SHOW VARIABLES on both servers shows the same character sets and collations (latin1 and latin1_swedish_ci).

So now I'm a little confused :\

C:\DOS:>
C:\DOS:>RUN
RUN DOS RUN!!
 
locate a duff record in the new server and find the equivilent record in the dump file on the original machine and the new one (assuming different machines and you FTP the file accross). Do you see anything unusual?
I wonder if you have encoding issues within the op system, what are the two op systems ?
 
I've tried dumping and restoring one of the tables specifying the encoding on both and ommitting it on both and still get the same problem, so I suspect you are right.

How do I check the system character sets and install any missing ones?

C:\DOS:>
C:\DOS:>RUN
RUN DOS RUN!!
 
I'm therefore getting odd characters showing up in my web pages.

And it is a long way from your database to the browser. First, let me tell you where it can go wrong. I make a difference between texts (character strings that should be human readable and therefore have an encoding) and strings (just sequences of bytes). The problem is that texts are almost always sent and stored as strings and the encoding travels separate. All systems have their own way of communicating the encoding.

[ul][li]The database field (, table, database. those merely act as default values for the fields).[/li]
[li]the database connection. If you do not send the encoding directly after connecting, the default is latin-1. So if you want to run everything in utf-8, send the command: SET NAMES utf8; directly after connecting to the database. It is a common beginner's fault to omit this. Alas, this can not be configured in my.cnf or my.ini. The encoding there is only for the command-line client.[/li]
[li]your website itself. For instance, PHP does not send an encoding by default, so it is latin-1. You can change that in php.ini or send a header yourself (Content-type: text/html; charset=utf-8)[/li]
[/ul]

"odd characters" can go both ways. If you send utf-8 encoded texts with a marker "this is latin-1", then you get à and ã characters.
If you send latin-1 encoded text marked as "this is utf-8", then you usually get squares or question marks.

But what is in your database? All latin-1 characters are one byte long, but special characters in utf-8 are longer. So if you use the HEX function and count the bytes, you can see what is actually stored.

I think there is a CONVERT USING function in MySQL that allows you to run UPDATE queries to change the contents of the fields. If you want to change an entire table to utf-8, including its contents and its structure defaults, use ALTER TABLE ... CONVERT TO utf8;

Good luck!


+++ Despite being wrong in every important aspect, that is a very good analogy +++
Hex (in Darwin's Watch)
 
OK, that makes sense. If I put:
header('Content-Type: text/html; charset=windows-1252');
at the top of the PHP page then it displays fine. I'm guessing that either Apache or PHP on the old server was configured to use the above charset but the new server is set to use UTF-8 by default. I'll have to find out which one it is and change it.

Thanks for the help.

C:\DOS:>
C:\DOS:>RUN
RUN DOS RUN!!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top