Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Match REGEX and ignore unicode diacritics -HOW?

Status
Not open for further replies.

sen5241b

IS-IT--Management
Sep 27, 2007
199
US
I want to match letters and ignore diacritics (accents, umlauts, circumflex, etc)

I want to match the letter u in my REGEX against u in a string regardless of how the u looks, i.e. the letter u in the REGEX should match ù or ú or û or ü in the input string.

Example: 'arger' or 'facade' should match 'Ärger', façade'.

(Yes, I have spent hours looking for a simple PHP example of this on a gazillion websites and I have also experimented with much code to try this).

Code:
<?PHP
echo ' <BR> begin';
$str1 = "get thè hellò of here";
$pattern1 = "/the\shello/ui";
echo preg_match($pattern1, $str1);
?>

Output from this is: begin0

I have PHP Version 5.1.4
and PCRE Library Version 6.6 06-Feb-2006.

Please help --I've begun to slap my monitor.
 
the only thing that i can think of is to write a subroutine to convert each diacritic to its base character before running the regex.

i do not know a way of doing this other than by hard coding the associations. there may be a pattern in the unicode that you could derive and use, i don't know.
 
Found this info:

"Built-in Unicode support is one of the features promised for PHP 6.
Earlier PHP versions can manipulate Unicode strings using the multi-byte string extension. However, this extension is not always available in every PHP installation.
This class implements a clever alternative to manipulate Unicode text encoded as UTF-8. It uses the PCRE extension PHP functions.
This extension can perform regular expression manipulation functions on UTF-8 strings and is available since PHP 3.

Manuel Lemos
 
I'm still not sure that this is going to help as an e-acute is a different letter to e. i think to make it work you'll have to have an extension to PCRE rather than php. i.e. PCRE will need to have not only a case insensitive switch but also a diacritical switch.

i wonder whether this is supported in perl? if so you could use php to call a perl script with a regex in it.
 
Sorry, the "class" referred to in my earlier post is called miscstring at phpclasses.org.

From what I understand so far, until PHP 6 arrives you have two choices: 1) Use MBstring functions that take into account multi-byte unicode characters. The regular string functions will mis-count a string with multi-byte unicode characters. 2) Use regular expressions to replace unicode characters with plain old ASCII or vice versa.

The bottom line is programmers need to jump through some unicode hoops until PHP 6 arrives.

Comments, questions disagreements?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top