Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Automatic Unicode mangling?

Status
Not open for further replies.

AdaHacker

Programmer
Sep 6, 2001
392
US
Greetings! I'm trying to figure out why PHP is escaping Unicode characters in POST data. My goal here is to find a non-hacky way to sanitize form data regarless of locality.

Here's what I'm seeing. I have some Russian text that a user sent me. If I copy and paste it as a literal sting into a PHP file, then it shows the Unicode characters as expected. However, if I copy and paste it into an HTML form and then echo back the $_POST variable, the browser shows the Unicode characters, but the HTML soruce shows numeric escape sequences like и.

Also, this seems to happen on only two of the three servers I've tested on. On the Redhat and WinXP systems, the Unicode gets escaped, but on my Kubuntu Breezy workstation, I get the raw Unicode characters. This leads me to believe that this is controlled by some configuration setting or build option, but I don't know what it could be.

So, does anyone know how to turn this off? Or, failing that, is there a good way to remove HTML from a string regardless of whether it is vanilla ASCII or Unicode escape sequences? I've been using htmlentities() to sanitize the input and then run the resulting string through a regular expression to fix the escape sequences, but that just doesn't feel right.
 
I think I found my answer. After some experimentation and reading the multi-byte string documentation, it appears that the problem is on the client side.

I made a very simple script to just echo POST data from a form. When I pasted the Cyrillic characters into the form, using my default encoding, i.e. ISO-8859-1, and POSTed the data, the HTML source of the resulting page showed numeric Unicode escape sequences. When I changed the encoding in my browser to UTF-8 and POSTed the same data, the resulting HTML source showed raw Unicode characters. This was on WinXP/IIS with both Opera 8.5 and Firefox 1.0.6.

So, in other words, I got the wrong forum. It's not a PHP problem at all. Apparently the browser detected a character set conflict and converted the offending characters to their escape codes in the correct character set. Or something like that. Maybe somebody who knows more about character encoding could explain it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top