Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Bug with preg_match_all ?? 1

Status
Not open for further replies.

sen5241b

IS-IT--Management
Sep 27, 2007
199
US
preg_match_all is returning inconsistent results.

I have 3 REGEXs in an array each of which looks for a word where any one vowel may be replaced by an asterisk.

Here the REGEX tries to find the word beep: /((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu"

Th first part of the REGEX uses negative lookahead, (?!.*[\*].*[\*].*), to ensure the word cannot have 2 asterisks. The second part, b[e\*][e\*]p), says look for the word and any vowel can be replaced by an asterisk. Thus the overall rule: find the word and any one vowel may be replaced by an asterisk.

Given the string "beep kook rise", my 3 REGEXs should find all 3 words. The REGEXs DO find the 3 words.

Given the string "beep kook r*se", my 3 REGEXs should find all 3 words. The REGEXs DO find the 3 words.

Given the string "beep k*ok r*se", my 3 REGEXs should find all 3 words. The REGEXs find only r*se. Weird!

Given the string "b*ep k*ok r*se", my 3 REGEXs should find all 3 words. Again the REGEXs find only the word r*se. Weirder!

Given the string "r*se kook b*ep", my 3 REGEXs should find all 3 words. The REGEXs find only b*ep and kook. Ultra-weird!

preg_match_all seems to have a problem when more than one word in the string has a vowel replaced with an asterisk.

What's weird is that I am calling preg_match_all 3 separate times to find the 3 words. One call should not affect the other but it seems to.

One other thing, the 'u' REGEX modifier means UN-greedy.

Code:
<?PHP
//
//  Find the words beep, kook or rise. Anyone ONE vowel can  be replaced by an asterisk and the REGEX should still find the word.
//
echo 'RUN THE REGEX WITH NEXT STRING==========================================';
$str1 = 'beep kook rise';		
FindTheWords($str1);
echo 'RUN THE REGEX WITH NEXT STRING==========================================';
$str1 = 'beep kook r*se';
FindTheWords($str1);
echo 'RUN THE REGEX WITH NEXT STRING==========================================';
$str1 = 'beep k*ok r*se';
FindTheWords($str1);
echo 'RUN THE REGEX WITH NEXT STRING==========================================';
$str1 = 'b*ep k*ok r*se';
FindTheWords($str1);
echo 'RUN THE REGEX WITH NEXT STRING==========================================';
$str1 = 'r*se kook b*ep';
FindTheWords($str1);
exit();

/* ========================================================================================================== */
function FindTheWords($str1)
{				
echo '<br>  INPUT STRING USED=';
print_r($str1);
// Rather than read the following array of REGEXs, it may be easier to just look at the output  to understand the REGEXs. 
$REGEXs_array = array("/((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu", "/((?!.*[\*].*[\*].*)k[o\*][o\*]k)/ixu", "/((?!.*[\*].*[\*].*)r[i\*]s[e\*])/ixu");
$iOffset = 0;
$TempArray = array("");
$array_elements = count($REGEXs_array);	  // 3 REGEXs in above array
for ($i=0; $i<$array_elements; $i++)  //For each REGEX, loop & look for word in passed string
{
	echo '<br> REGEX used=';
	var_dump($REGEXs_array[$i]);
	preg_match_all($REGEXs_array[$i], $str1, $TempArray, PREG_OFFSET_CAPTURE, $iOffset);   
	echo "<br>  MATCH #$i was FOUND=" . $TempArray[0][0][0];	
	//  $TempArray[0][0][0] contains full match
	echo '<br>  Position in string=' . $TempArray[0][0][1];	
	// $TempArray[1]  is partial matches - don't care about'em 
	echo '<br>-----------';
}
}
 ?>


The output form the above code:

Code:
RUN THE REGEX WITH NEXT STRING==========================================
INPUT STRING USED=beep kook rise
REGEX used=string(37) "/((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu"
MATCH #0 was FOUND=beep
Position in string=0
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)k[o\*][o\*]k)/ixu"
MATCH #1 was FOUND=kook
Position in string=5
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)r[i\*]s[e\*])/ixu"
MATCH #2 was FOUND=rise
Position in string=10
-----------RUN THE REGEX WITH NEXT STRING==========================================
INPUT STRING USED=beep kook r*se
REGEX used=string(37) "/((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu"
MATCH #0 was FOUND=beep
Position in string=0
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)k[o\*][o\*]k)/ixu"
MATCH #1 was FOUND=kook
Position in string=5
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)r[i\*]s[e\*])/ixu"
MATCH #2 was FOUND=r*se
Position in string=10
-----------RUN THE REGEX WITH NEXT STRING==========================================
INPUT STRING USED=beep k*ok r*se
REGEX used=string(37) "/((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu"
MATCH #0 was FOUND=
Position in string=
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)k[o\*][o\*]k)/ixu"
MATCH #1 was FOUND=
Position in string=
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)r[i\*]s[e\*])/ixu"
MATCH #2 was FOUND=r*se
Position in string=10
-----------RUN THE REGEX WITH NEXT STRING==========================================
INPUT STRING USED=b*ep k*ok r*se
REGEX used=string(37) "/((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu"
MATCH #0 was FOUND=
Position in string=
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)k[o\*][o\*]k)/ixu"
MATCH #1 was FOUND=
Position in string=
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)r[i\*]s[e\*])/ixu"
MATCH #2 was FOUND=r*se
Position in string=10
-----------RUN THE REGEX WITH NEXT STRING==========================================
INPUT STRING USED=r*se kook b*ep
REGEX used=string(37) "/((?!.*[\*].*[\*].*)b[e\*][e\*]p)/ixu"
MATCH #0 was FOUND=b*ep
Position in string=10
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)k[o\*][o\*]k)/ixu"
MATCH #1 was FOUND=kook
Position in string=5
-----------
REGEX used=string(37) "/((?!.*[\*].*[\*].*)r[i\*]s[e\*])/ixu"
MATCH #2 was FOUND=
Position in string=
-----------
 
Is the fact that my UN-greedy negative lookahead is seeing two asterisks in some of the strings above have something to do with this?
 
[1] The results are all conformed with the regex. There is nothing to do with the modifier "xu".

[2] "Lookahead" position so scripted include the pattern intended to match. If you want to have negative lookahead "after" the pattern intended to match is found, you should put (?!...) after the beep or kook or rose or their replacement.

[3] Here is how to read the pattern and why the output be so.
[3.1] >Given the string "beep k*ok r*se", my 3 REGEXs should find all 3 words. The REGEXs find only r*se. Weird!
[3.1.1] Precisely, your 3 REGEXs should find only r*se. This is the mechanism.
[3.1.1.1] The parsing starts linearly and has found start anchor (^) and found ahead of it two stars. So no-match.
[3.1.1.2] The parsing continues and has found character b and found ahead of it two stars. So no-match.
[3.1.1.3] The parsing continues and has found a boundary of the characters (be) and found ahead of it two stars. So no-match.
[3.1.1.4] The parsing continues and has found character e and found ahead of it two stars. So no-match.
... etc etc ...until
[3.1.1.5] The parsing continues and has found a boundary after the characters (p of beep) and found ahead of it two stars. So no-match. By this time beep is hopeless to be matched any longer.
... etc
[3.1.1.6] The parsing continues and has found character k of k*ok and found ahead of it two stars (one within the word k*ok). So no-match.
[3.1.1.7] (I skip the character boundary). The parsing has found * of k*ok. Now it has found only one * ahead. The negative lookahead results positive. But * is not k, hence the second particle of the pattern is not matched. In fact, at this position, k*ok is hopeless to be matched.
...
[3.1.1.8] Finally, when the parsing reaches r of r*se, the match is found. Hence the return is "r*se".

[4] If in the case analyzed in [3], you want to match k*ok as well as r*se, you negative lookahead's position should be after the pattern.
[tt]
$REGEXs_array = array("/(b[e\*][e\*]p[blue](?!.*[\*].*[\*].*)[/blue])/ixu", "/(k[o\*][o\*]k[blue](?!.*[\*].*[\*].*)[/blue])/ixu", "/(r[i\*]s[e\*][blue](?!.*[\*].*[\*].*)[/blue])/ixu");
[/tt]
[5] I fear if I continue, all become verbalism.





 
Further notes
[3.2] In the three "weird" "Given..." lines, the third may not be correctly reporting the observation.
>Given the string "r*se kook b*ep", my 3 REGEXs should find all 3 words. The REGEXs find only b*ep and kook. Ultra-weird!
In fact, I suspect and a quick run confirms, it should find 2 words, kook and b*ep. (Analysis goes the same.) Can you check again on this? and confirm your statement?
 
Thanks for your analysis Tsuji, Last night, just after I posted my problem I ran the REGEXs through "Regex Coach" step by step and I understood why I was getting the results. Turning greediness on and off in REGEX coach had no effect on the result. (Regex coach is a good program and its free.) I'll try putting the negative lookahead after the word pattern. Also, I'll check [3.2].
 
[3.2-Rev] Please ignore [3.2]. My description is exactly the same your description, only kook and b*ep reverse the order! Don't know what had taken me. The description is just fine... (and thanks for the feedback and vote!)
 
I changed the rule slightly: If any one vowel was replaced by an asterisk I want to match but if no vowels were replaced I do NOT want to match.

Putting the look ahead after the pattern does not give me the result I need. Also using negative look behind gives me an error about variable length look behind not implemented. (I understand there is some plan to implement this). I'm thinking the simple solution is to use the cumbersome REGEX: be\*p|b\*ep
The problem here is when I have a word with many vowels the REGEX gets very cumbersome like:
s\*perior|sup\*rior|super\*or|superi\*r

Maybe using look around, conditionals, capturing groups is not always the best way.

 
[6]
>If any one vowel was replaced by an asterisk I want to match but if no vowels were replaced I do NOT want to match.
This implies a word as the unit to match. In that case, you can use a pattern like this.
[tt] "/\b((?!([\S]*?\*[\S]*?){2,})s[aeiou*]p[aeiou*]r[aeiou*][aeiou*]r\b)/i"[/tt]
I deliberately put all the vowels in the character set for illustration only. Also this time I do not escape asterik in the set to raise the point that it can be put as such without escaping in that context (of course, you can continue to escape it which does no harm neither). The pattern would match a word in the form like superior, sup*rior, superi*r, or alike, but it won't match a work like sup*r*or, s*p*rior etc...
 
I tried putting your REGEX into 'REGEX Coach' and it matches the word with no asterisks in it.
 
Tsuji,

I added a negative lookbehind to your suggested REGEX and it works.

((?!([\S]*?\*[\S]*?){2,})(s[u*]p[e*]r[i*][o*]r)(?<!superior))
But is their an easier way to do this or can the above REGEX be simplified. I am open to non-REGEX PHP functions that might accomplish the rule: If any one vowel in a specific word was replaced by an asterisk I want to match, but if no vowels were replaced I do NOT want to match .
 
I see. I missed that part, thinking only the criteria that need an effort.

Sure, the appearance of asterik only once and not more and no less is something positive. Hence, a single positive lookahead should be sufficient. This is one realization along the same line of deduction.
[tt] /\b((?=([^*\s]*?\*[^*\s]*?(\s|$)))s[u*]p[e*]r[i*][o*]r)\b/i
[/tt]
 
I had one more question about the above REGEX. You have *? between the character classes in the lookahead. The asterisk following the char. class means zero or more times but why follow that with a question mark (0 or 1 times)?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top