Greetings! I'm trying to figure out why PHP is escaping Unicode characters in POST data. My goal here is to find a non-hacky way to sanitize form data regarless of locality.
Here's what I'm seeing. I have some Russian text that a user sent me. If I copy and paste it as a literal sting into a PHP file, then it shows the Unicode characters as expected. However, if I copy and paste it into an HTML form and then echo back the $_POST variable, the browser shows the Unicode characters, but the HTML soruce shows numeric escape sequences like и.
Also, this seems to happen on only two of the three servers I've tested on. On the Redhat and WinXP systems, the Unicode gets escaped, but on my Kubuntu Breezy workstation, I get the raw Unicode characters. This leads me to believe that this is controlled by some configuration setting or build option, but I don't know what it could be.
So, does anyone know how to turn this off? Or, failing that, is there a good way to remove HTML from a string regardless of whether it is vanilla ASCII or Unicode escape sequences? I've been using htmlentities() to sanitize the input and then run the resulting string through a regular expression to fix the escape sequences, but that just doesn't feel right.
Here's what I'm seeing. I have some Russian text that a user sent me. If I copy and paste it as a literal sting into a PHP file, then it shows the Unicode characters as expected. However, if I copy and paste it into an HTML form and then echo back the $_POST variable, the browser shows the Unicode characters, but the HTML soruce shows numeric escape sequences like и.
Also, this seems to happen on only two of the three servers I've tested on. On the Redhat and WinXP systems, the Unicode gets escaped, but on my Kubuntu Breezy workstation, I get the raw Unicode characters. This leads me to believe that this is controlled by some configuration setting or build option, but I don't know what it could be.
So, does anyone know how to turn this off? Or, failing that, is there a good way to remove HTML from a string regardless of whether it is vanilla ASCII or Unicode escape sequences? I've been using htmlentities() to sanitize the input and then run the resulting string through a regular expression to fix the escape sequences, but that just doesn't feel right.