Encoding Foreign Characters

PCHomepage · Dec 5, 2014

I've had a function on my site for years and it has been working reasonably well in converting foreign characters from a form's textarea field to the corresponding HTML entities. But the problem is that the text being inserted comes from a wide variety of sources where I have no control of its actual encoding so it occasionally crashes or puts in codes that were not in the original, causing the resulting page to be unreadable. All of the HTML and PHP pages, the database and the connection are already set to UTF8.

Because the problem was to do with encoding, I added some PHP functions to the WHILE loop that should sort it out. I don't want to replace all characters - only certain ones that exist in a table but I cannot get it to work properly. If I return $CharacterName it gives the last character in the table as expected due to the looping. If I return $Replacement it gives the proper HTML entity as expected but the characters are not being replaced in $BodyText.

I can't change $DBcharacters->next_record() or other database functions as these are custom functions used elsewhere and they are working well but I can change the looping so if anyone has a thought about this, I would appreciate hearing it! It should be something quite simple that I have missed. Thank you.

PHP:

function cleanHTML($BodyText) {
// replace foreign characters
	$SQL = "SELECT CharacterName FROM charactercodes";
	$DBcharacters->query($SQL);
	while ($DBcharacters->next_record()) {
		$CharacterName = htmlspecialchars_decode(htmlentities($DBcharacters->f("CharacterName"), ENT_QUOTES, "UTF-8"));
		$Replacement = htmlentities($CharacterName);
		$BodyText = str_replace($CharacterName, $Replacement, $BodyText);
	}
  return $BodyText;
}

PCHomepage · Dec 5, 2014

I have it working although it's a bit of a kludge because I don't think the WHILE loop is really what's doing the work. This line is but without it, it does nothing:

PHP:

$BodyText = htmlspecialchars_decode(htmlentities($BodyText, ENT_QUOTES, "UTF-8"));

Also, without this line it creates HTML entitles for all the ampersands when that is not a character in the charactercodes table but it apparently does not create HTML entities for < and > so that HTML still works if it happens to be embedded into the text.

PHP:

$BodyText = str_replace("&amp;", "&", $BodyText);

If anyone can tell me what to do to make it work as it should, I would appreciate it:

PHP:

function cleanHTML($BodyText) {
// replace foreign characters
	$SQL = "SELECT CharacterName FROM charactercodes";
	$DBcharacters->query($SQL);
	$BodyText = htmlspecialchars_decode(htmlentities($BodyText, ENT_QUOTES, "UTF-8"));
	while ($DBcharacters->next_record()) {
		$OriginalCharacter = $DBcharacters->f("CharacterName");
		$CharacterName = htmlspecialchars_decode(htmlentities($OriginalCharacter, ENT_QUOTES, "UTF-8"));
		$Replacement = htmlentities($CharacterName);
		$BodyText = str_replace($CharacterName, $Replacement, $BodyText);
	}
	$BodyText = str_replace("&amp;amp;", "&", $BodyText);
  return $BodyText;
}

jpadie · Dec 6, 2014

forgive me for hijacking this thread but two similar threads in as many days suggests that other readers might benefit from some quick reminders

a wide variety of sources where I have no control of its actual encoding

does that mean that they are not all uploaded over a web form via a text box?
if so, then you do have control over the encoding, as you receive it.

if they are uploaded as files, then again you have the ability to detect the encoding at that time, and take steps to normalise it.

if you are storing the uploaded file in raw form (as a blob) then again, you have the ability to detect the encoding of the file and convert it to whatever form your display can handle.

if the text is being extracted from (say) a pdf then uploaded then it depends on the process. if you are just taking the text out (no OCR) then you can get the encoding of the PDF at the same time and make appropriate manipulations at that moment. If an OCR then you are in more difficulties although you would still have to extract the encoding of the PDF to inform the OCR suite what character set is being represented on the page. most good OCR suites will then output in whatever encoding you specify. or you can manipulate the output yourself from the known output of the OCR suite.

the core take home is that at all points you should try to ensure that you do have control over/knowledge of the incoming encoding and take steps at that time to convert and/or preserve knowledge of the encoding. Otherwise you will be forever relying on guess work.

another good lesson for other readers is (apart from transliteration for encoding purposes), don't manipulate data to be stored with non-idempotent actions like htmlspecialchars etc. store in the raw form and manipulate on display. If you must store a manipulated form because your server is too slow for realtime operations, ensure you ALSO store the raw.

---

anyway, back to your actual question!

questions for you

1. did you manipulate the data before inserting into the database? if so can you post the manipulation code?
2. did you also store the raw data? if so can you post a before and after version of some troublesome text.

as i decode your function my understand is it does the following (I have put in a code block to make it 'easier' to read)...

Code:

[ignore]
$BodyText = htmlspecialchars_decode(htmlentities($BodyText, ENT_QUOTES, "UTF-8"));

encode the bodytext with htmlentity substitution.  this means that _every_ character that is in 
the text that has an html entity equivalent will be substituted. [consider using HTML5 as an 
additional flag; also consider using ENT_COMPAT unless you are certain that single quotes 
are not being used as double escape characters anywhere]  

then with that output reverse out the work that you have just done, so that 
&amp; becomes ampersand
&quot; becomes double quotes
&#039; are left untouched and not converted back to single quotes (you have not sent 
the quotes flag)
&lt; becomes the less than symbol
&gt; becomes the greater than symbol

I don't see the value in that exercise but trust that you have a reason.

$CharacterName = htmlspecialchars_decode(htmlentities($OriginalCharacter, ENT_QUOTES, "UTF-8"));
		
you then do the same thing on a character by character basis.  so taking for example the pound 
sign.  that will be converted to an html entity and then left intact (as it is not an html 
special character).  

so $CharacterName at that point will be &pound;

$Replacement = htmlentities($CharacterName);

this confuses me.  at this point you are reconverting something that has either just been 
converted or just been converted and unconverted.  and importantly you are not specifying 
a strategy for quote handling nor an output charset. potentially a recipe for disaster.  

the reconversion will completely break our pound example as each char will be converted.  

so $CharacterName is (at the moment) &pound;
$Replacement will be &amp;pound;

is that intended?  no browser will be able to deal with that to display a pound sign. 
it will look like &pound;

next you do a search and replace of $BodyText such that any 'properly' formed &pound;(s) 
will be broken.

$BodyText = str_replace("&amp;", "&", $BodyText);
and then lastly, once those transformations are done, you go back through the whole 
string and transform those &amp;pound; back to how they should be

BUT you also transform PROPER ampersand character entities back to a pure ampersand.  
hopefully with modern browsers and a proper page encoding that won't matter.  but it does 
rather undo a good part of what you are intending. 
[/ignore]

I ran this through an interactive php session so you could see what is happening. here is the trace

Code:

[ignore]
php > $OriginalCharacter = '£';
php > $CharacterName = htmlentities($OriginalCharacter, ENT_QUOTES, 'UTF-8');
php > echo $CharacterName ."\n";
&pound;
php > $CharacterName = htmlspecialchars_decode($CharacterName);
php > echo $CharacterName ."\n";
&pound;
php > $Replacement = htmlentities($CharacterName);
php > echo $Replacement ."\n";
&amp;pound;
php > 
[/ignore]

my suggestion is to take a step back and rearticulate the original problem, then work out how to solve it. Unless I have missed the point, these iterative steps are not the way to go.

and beware - these functions are only safe if both input and output share the same character set. if there is a chance that the input does not share the same character set then this has a great chance of garbaging the output.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Encoding Foreign Characters

PCHomepage

Programmer

PCHomepage

Programmer

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor