Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

help with regexp 1

Status
Not open for further replies.

anuktac

Technical User
Aug 1, 2002
48
0
0
IN
Hi
I have a rather complicated pattern matching code which I am not able to figure out.

open (LU, "> lookup.txt");
open (NONLU, "> lookup.rej");
open (WISLU, "wisdom.qsl");
while (<WISLU>)
{
if (/n01/)
{
if (/n01\s*((?:\(*\w*\)*\s*\(*-*&*'*\/*\.*:*,*&quot;*#*%*\+*£*\s*\)*)*);c=(\w+)\(?(\d*,?\d*)\)?\.(?:eq|in)\.\(?(-?\d+:?\d*)\)?/)
{
print LU &quot;$2~$3~$4~$1\n&quot;;
} else
{
print NONLU &quot;zz $_\n&quot;;
}
}

}

The wisdom.qsl file looks something like this:


n01 Pipex;c=A140D(1).eq.1
n01 LineOne;c=A140D(2).eq.1
n01 America On Line (AOL);c=A140D(3).eq.1
n23Prompted awareness of Internet Service Providers;norow;unl1
n03
n01 1;c=A150.eq.1
n01 2;c=A150.eq.2
n01 3;c=A150.eq.3
n01 4;c=A150.eq.4
n01 5;c=A150.eq.5
n01 6;c=A150.eq.6
n01 7-8;c=A150.eq.7
n01 9-10;c=A150.eq.8
n01 Yahoo!;c=ilik11(35).eq.1
n01 11 to 15 minutes;c=INET10(1,4).in.(11:15)


The output expected in lookup.txt is like this :


A140D~1~1~Pipex
A140D~2~1~LineOne
A140D~3~1~America On Line (AOL)
A150~~1~1
A150~~2~2
A150~~3~3
A150~~4~4
A150~~5~5
A150~~6~6
A150~~7~7-8
A150~~8~9-10
INET10~1,4~11:15~11 to 15 minutes

Can someone explain to me what the pattern matching expression means ?
-Anukta
 
basically the [red]$1 $2 $3 $4[/red] in the print LU &quot;$2~$3~$4~$1\n&quot;; are matches in the regex. where you see braces (???) if there is a match it will be stored in these variables. $1 is the first match in braces, $2 the second, etc. - from left to right.

That's a pretty big regex so a bit of a pain to explain!

Hope this reply is of some help...

Duncan
 
Thanks Duncan.
I know about the $1 $2 $3 $4 concept.
I was looking for an explanation of the regexp.

How is :? interpreted?
Is there any special significance of &, #, % and £?
There is a :?, a ?: and -? towards the end.
What do they signify?

-Anukta
 
:? is 0 or 1 &quot;:&quot; characters
there isnt any special significance of &#%and £

. matches any 1 character and the *?+ are modifers to the directly preceeding charecter..

basically anytime you see a * it means 0 or more of the previous thing, ? means 0 or 1 and + means 1 or more...

*? means non-greedy * (meaning match as little as possible)
so \+* means match + or nothing

for example

\(*-*&*'*\/*\.*:*,*&quot;*#*%*\+*£*\s*\)*

looks like its just trying to say you might find a string like :
(-&'/.:,&quot;#%+£ )

...and anyone of those items may or may not be there...but it must be in that order! (which is what makes the match strange...) if you wanted to match on a possible (stuff) and stuff could be items in that list, then go:


(\([-&'\/\.:,&quot;#%+£\s]*\))*

is this what you're asking?

 
I've written one that I think is a bit simpler:-

open (WISLU, &quot;wisdom.qsl&quot;);

while (<WISLU>) {

print &quot;$2~$3~$4~$1\n&quot; if m/^n01 (\D+);c=(A\d+[A-Z])\((\d+)\)\.eq\.(\d)/;

print &quot;$2~~$3~$1\n&quot; if m/n01 (\d+);c=(A\d+)\.eq\.(\d+)/;

print &quot;$2~~$3~$1\n&quot; if m/n01 (\d+-?\d+?);c=(A\d+)\.eq\.(\d+)/;

print &quot;$2~$3~$4~$1\n&quot; if m/^n01 (.+);c=(INET\d+)\((\d+,\d+)\)\.in\.\((\d+:\d+)\)/;

}

print &quot;\n&quot;;

close WISLU;


Duncan
 
Thanks CadGuyExtraOrdinaire,

yes that was the answer I was looking for.

One more question
since patterns are grouped together in () ,to me it looks like $4 should be the the pattern grouped as (?:eq|in).

But this is not so.From the output it is obvious that the group
(?(-?\d+:?\d*)\) is $4.

What am I missing here?
-Anukta
 
( ... ) is used for grouping and capturing
(?: ... ) is used for grouping only

this is a regular expression extension

Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top