Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Problem with diacritical folding - help before all hair gone 1

Status
Not open for further replies.

sen5241b

IS-IT--Management
Sep 27, 2007
199
US
I have a string entered into a HTML form statement in a webpage. PROGRAM1.html passes the string to a PHP script PROGRAM3.PHP via an intermediary PHP script PROGRAM2.PHP.

Pseudo: PROGRAM1.html string--> PROGRAM2.PHP string--> PROGRAM3.PHP

I use the following method to perform what unicode.org calls diacritical folding -getting rid of all the accent marks and replacing accented letters with no-accent letters:

Code:
function FoldDiacritics($changestr)
{
echo '<BR> BEFORE AAA $changestr=' . $changestr . '<BR>';
$changestr = strtr($changestr, "àáâåãä",	"aaaaaa");
$changestr = strtr($changestr, "èéêë",	"eeee");
$changestr = strtr($changestr, "ß", 		"b");
$changestr = strtr($changestr, "ìíîï", 	"iiii");
$changestr = strtr($changestr, "òóôõö", 	"ooooo");
$changestr = strtr($changestr, "š", 		"s");
$changestr = strtr($changestr, "ç", 		"c");
$changestr = strtr($changestr, "ùúûü", 	"uuuu");
$changestr = strtr($changestr, "š", 		"s");
$changestr = strtr($changestr, "ƒ", 		"f");
$changestr = strtr($changestr, "ñ", 		"n"); 
echo '<BR> AFTER BBB $changestr=' . $changestr . '<BR>';
return $changestr;
}
  ?>

But this just doesn't work no matter how simple I make the code. I also tried just about all the other diacritical folding solutions shown on the PHP strtr man page. None work for me. So I start three new files from scratch: PROGRAM1B.html, PROGRAM2B.PHP and PROGRAM3B.PHP. Note the 'B'. I used notepad++. Bit by bit I copy a little code from each of the original three to the new three, establishing a little functionality in the new three scripts as I go. When all code has been copied from the original three to the new three, the above code works perfectly in the new three scripts but the original three still don't work and its all the same code running on the same server and the only difference is the new three scripts have a 'B' in their name!!!!! Arrrgh!

So I say: I'll just get rid of the original three and rename the new three to take the 'B' out. After renaming, the new three now produce a bizzare translation of the diacritic letters to some weird character set not represented in the function above.

I guess the answer here is perhaps a whole topic I am unfamiliar with.

By the way, as you may know unicode will not be supported until version 6 of PHP.
 
the naming of a file is irrelevant. there will be something in your script that is causing the glitch.

i can't really see why you're using strtr rather than strreplace or it's case insensitive sibling, but i guess it does not matter. i would tend to use the former when i needed to be sure that i was not going to undo changes performed by a previous translation. with your example (i) this is unlikely) and (ii) you are not taking advantage of the functionality by performing your code through a single array.

if you post your original scripts with a pure cut and paste (don't edit anything more than a password) then we may be able to diagnose the original error.

can you also be certain that these accented characters are actually what you intend to change? i think it would be more sensible to use the character codes.

the user contributed notes to strtr in the php manual has some useful looking examples too.

another alternative is to convert each accented character into its html encoded equivalent. then just strip out the preceding ampersand and the accent reference, leaving the naked letter behind. preg_replace can do this for you. remember to convert back to decoded html afterwards.

 
I can't see any reason in the world a file rename could affect the script output but the scripts ran fine, I renamed them and then they broke! I've seen problems with text files saved as Unicode in MS notepad and then later opened with a non-Unicode editor. Maybe there was some hidden garbage character in the code.
 
opening and saving in different charsets may well have changed the underlying chr value of the accented characters. they might look the same (as the opening program is doing some fancy footwork) but they won't be the same as you intended.

as per previous: my advice is to avoid the problem and either go for the htmlentities option or address the underlying chr values directly.
 
I tried using hex and decimal codes instead of the actual accented letters and this method also did not work -but yes using htmlentities or numeric codes are definitely better than embedding accented letters. thx!
 
ok, how about this:

Code:
function FoldDiacritics($changestr)				// also called accent removal
{
$changestr = strtr($changestr, "\xDF",	"b");
$changestr = strtr($changestr, "\x83",	"f");
$changestr = strtr($changestr, "\xE0\xE1\xE2\xE5\xE3\xE4\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF2\xF3\xF4\xF5\xF6\x9A\xE7\xF9\xFA\xFB\xFC\xF1",	"aaaaaaeeeeiiiioooooscuuuun");
$changestr = strtr($changestr, "\xC0\xC1\xC2\xC5\xC3\xC4\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD2\xD3\xD4\xD5\xD6\x8A\xC7\xD9\xDA\xDB\xDC\xD1",	"AAAAAAEEEEIIIIOOOOOSCUUUUN");
return $changestr;
}

The hex of course is the accented letters. The second to last line does lower case and the last does upper case. strtolower will not work for accented letters.

QUESTION: I guess I have a knowledge deficit with character sets, but if the character set changes from UTF8 to something else does the hex (or decimal) value representing that letter change? I realize a web developer could define the character set in a meta tag, but suppose for some reason you may not be able to predict the character set in use?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top