Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex Question

Status
Not open for further replies.

AndrewRR

Programmer
Oct 5, 2008
2
0
0
MD
Hello

Could you help me with regex related question?

I have code:
--------------
$s = '-html=-head=head data-head=-body=BODY-body=-html=';
while ($s =~ /=(.+?)-/gxis) {
print "$1\n";
}
--------------
when I run this code I see results:
---------
-head=head data
-body=BODY
---------
but I expect to see:
-----------
head data
BODY
-----------

could you explain what is the problem in?

thanks
 
its because the "-" is part of the match found in (.+?). What you want is:

Code:
while ($s =~ /=([^-]+)-/g)

that finds everything between "=" and "-" but that does not contain a "-" between them.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Thanks for reply

Could you explain again, why - is part of found string?

I thought if I use .+? with ?
it means the smallest possible string to find....
 
You're right, it's trying to find the smallest possible match when you use '?'. Look at it from left to right:

Code:
$s = '-html=-head=head data-head=-body=BODY-body=-html=';

DISCLAIMER
I'm sure this isn't the actual way Perl does goes about pattern matching, but this is what I say to myself as I'm going through the pattern.
END OF DISCLAIMER

1) Find '='.
# finds the '=' in -html=-head
2) Find the first '-' that occurs after 1 or more wildcards.
# finds the '-' in head data-head
3) Grab the stuff in the middle
# gets -head=head data

I think your confusion is coming from the fact that you're using '.+?' instead of '.*?'. '.+?' looks for 1 or more, while '.*?' looks for 0 or more.

If you change it to '.*?' then you'll get what you want (but you'll also get some blanks). The new code could look like:

Code:
my $s = '-html=-head=head data-head=-body=BODY-body=-html=';
while ($s =~ /=(.*?)-/gxis) {
    print "$1\n" if $1;
}

I would actually use KevinADC's code. I only put this code up to hopefully de-confuse you.

--
 
Could you explain again, why - is part of found string?

sycoogtit has pretty much explained it. Your regexp is matching stingy (as verus greedy) but it needs to match at least one character because of the + quantifier. So it finds these patterns that satisfy the pattern you have defined. Blue shows the beginning and end anchors of your pattern, red shows whats matched:

$s = '-html[blue]=[/blue][red]-head=head data[/red][blue]-[/blue]head[blue]=[/blue][red]-body=BODY[/red][blue]-[/blue]body=-html=';

So the regexp finds the first "=" symbol. The next character is "-". So you have "=-". But there is nothing between them so there is no match yet so perl moves to the next character, the "h". Now it has the first matched character but it needs to keep matching until it finds the next "-". So the shortest match is "-head=head data". Same with the next pattern in the string.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top