Match REGEX and ignore unicode diacritics -HOW?

sen5241b · Dec 13, 2007

I want to match letters and ignore diacritics (accents, umlauts, circumflex, etc)

I want to match the letter u in my REGEX against u in a string regardless of how the u looks, i.e. the letter u in the REGEX should match ù or ú or û or ü in the input string.

Example: 'arger' or 'facade' should match 'Ärger', façade'.

(Yes, I have spent hours looking for a simple PHP example of this on a gazillion websites and I have also experimented with much code to try this).

Code:

<?PHP
echo ' <BR> begin';
$str1 = "get thè hellò of here";
$pattern1 = "/the\shello/ui";
echo preg_match($pattern1, $str1);
?>

Output from this is: begin0

I have PHP Version 5.1.4
and PCRE Library Version 6.6 06-Feb-2006.

Please help --I've begun to slap my monitor.

jpadie · Dec 13, 2007

the only thing that i can think of is to write a subroutine to convert each diacritic to its base character before running the regex.

i do not know a way of doing this other than by hard coding the associations. there may be a pattern in the unicode that you could derive and use, i don't know.

sen5241b · Dec 13, 2007

Found this info:

"Built-in Unicode support is one of the features promised for PHP 6.
Earlier PHP versions can manipulate Unicode strings using the multi-byte string extension. However, this extension is not always available in every PHP installation.
This class implements a clever alternative to manipulate Unicode text encoded as UTF-8. It uses the PCRE extension PHP functions.
This extension can perform regular expression manipulation functions on UTF-8 strings and is available since PHP 3.

Manuel Lemos

jpadie · Dec 13, 2007

I'm still not sure that this is going to help as an e-acute is a different letter to e. i think to make it work you'll have to have an extension to PCRE rather than php. i.e. PCRE will need to have not only a case insensitive switch but also a diacritical switch.

i wonder whether this is supported in perl? if so you could use php to call a perl script with a regex in it.

sen5241b · Dec 14, 2007

Sorry, the "class" referred to in my earlier post is called miscstring at phpclasses.org.

From what I understand so far, until PHP 6 arrives you have two choices: 1) Use MBstring functions that take into account multi-byte unicode characters. The regular string functions will mis-count a string with multi-byte unicode characters. 2) Use regular expressions to replace unicode characters with plain old ASCII or vice versa.

The bottom line is programmers need to jump through some unicode hoops until PHP 6 arrives.

Comments, questions disagreements?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Match REGEX and ignore unicode diacritics -HOW?

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor