Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Display number of matching regex components 2

Status
Not open for further replies.

hydrocomputer

Programmer
Nov 29, 2004
6
US
I want to write a generic web extraction subroutine in perl. To do so, I want to:
1. find the first pattern (easy)
2. toss everything before it (easy)
3. iteratively find successive occurrences of a patten of interest (easy)

Then comes the hard part. Given an arbitrary pattern containing parens, I want to build a list of what was found, like:

if ($line =~ /$pat/) {
$list = [$1,$2,$3,...];
push(@found, $list);
}

Except that I'd like the number of matches to be variable. Short of building dynamic strings and evaluating them, is there any clean way of doing this in PERL? Is there a regex variable for the maximum number of patterns extracted?

thanks!
 
Hmm, you mean something like this?
Code:
#!perl
use strict;
use warnings;

my $line = "1.2 blah 147 3.5 hey 9999 14.75";
my $pattern = qr/\d+\.\d+/;
my @found;

if ([b]my @list = $line =~ /($pattern)/g[/b]) {
    push(@found, [ @list ]);
}

for (@found) {
    print join("\t", @$_), "\n";
}


 
No, that's not really using the dynamic capture of the pattern. You're not really even using the parens.
Here's an example of what I want:

splitAfterPattern(" .
"nwis/uv?dd_cd=07&format=rdb&period=2&site_no=01407081",
"^5s\s+", "(\d*.\d+)\s+(\d+)\s+(\w+)", "</b>");

This example shows a pattern with three parens. What if I use 4 or 5? I want the corresponding number of columns in an array returned to me:

[
[1.23,222,'blah'],
[4.55,915,'foo']
]


I don't know how to find the number of matches in a pattern, though I could just scan the pattern itself, either naively for parens:

@pat =~ /\(/;
$n = @pat; # count number of occurrences

or do something more sophisticated that ignores parens that are escaped, etc. If so, then what can I do?

$extract = "";
for ($i = 0; $i < $n; $i++) {
$extract .= '$' . $i;

Now, $extract contains the variables I want to print. The only way I know to trigger the evaluation is something like:

$a = 5;
$b = 6;
$c = 17;
$x = '"$a$b$c"';
print eval($x);


This is pretty ugly. Is there anything simple that will do it?
 
clarification and correction:

The above example shows a subroutine call where I pass in a URL, read it, get the resulting text, skip until the first pattern is found, take everything until the last pattern is found, and in between, for each line extract according to a rule and pull the pieces into an array.

This line:
@pat =~ /\(/;


should actually be:

@pat = $pat =~ /\(/;

 
are you wanting to count the number of "(" found in the string/pattern?

my $count = $string =~ tr/(/(/;
 
Try this sample. I think you'll find it gives you exactly what you want:
#!/usr/bin/perl -w
use strict;

my $text = "Blah blah blah, this is a test text string for regex testing.";
my $regex1 = qr/(blah).*(test).*(string)/;
my $regex2 = qr/(Blah).*(blah).*(,).*(test).*(string).*(regex)/;

$text =~ /$regex1/;
my @result = ($1, $2, $3, $4, $5, $6, $7, $8, $9)[0 .. $#+ -1];
print "Results are [@result]\n";

$text =~ /$regex2/;
@result = ($1, $2, $3, $4, $5, $6, $7, $8, $9)[0 .. $#+ -1];
print "Results are [@result]\n";
[/code]

Trojan

Code:
 
To be honest, I don't much like this method, as it runs the regex twice for every found match, but it does do what you ask reasonably cleanly. Working off Mike's:
Code:
[COLOR=#0000FF]use[/color] strict;
[COLOR=#0000FF]use[/color] warnings;

[COLOR=#0000FF]my[/color] $line = [COLOR=#808080]"1.2 blah 147 3.5 hey 9999 14.75"[/color];
[COLOR=#0000FF]my[/color] $pattern = qr/(\d+)\.(\d+)/;
[COLOR=#0000FF]my[/color] @found;

[COLOR=#0000FF]while[/color] ($line =~ /$pattern/g) {
        [COLOR=#0000FF]my[/color] @list = $& =~ /$pattern/;
        [COLOR=#FF0000]push[/color](@found, [ignore][[/ignore] @list ]);
}

[COLOR=#0000FF]for[/color] (@found) {
        [COLOR=#FF0000]print[/color] [COLOR=#FF0000]join[/color]([COLOR=#808080]"\t"[/color], @$_), [COLOR=#808080]"\n"[/color];
}
That should work for just about anything. When global searching in list context, it matches until it can't match any more. To get each match, you have to run it in scalar context, but then you don't get your backreferences in an array. So, once in scalar context to get each match and once in list context only on what was matched to get the backreferences for the match. Should work for an arbitrary number of backreferences (ie, not limited to 9).

________________________________________
Andrew

I work for a gift card company!
 
Nice idea.
But surely you don't actually need to execute the regex at least twice for each record.
The fact that the regex in a list context give the required result should surely be enough.

Ala:

Code:
my $regex1 = qr/(blah) (blah)/;
my $regex2 = qr/(another) (dummy) (regex)/;

     .....

my @r1 = $record =~ /$regex1/; # will hold 2 fields 
my @r2 = $record =~ /$regex2/; # will hold 3 fields
 
Unless I'm misunderstanding the OP, that's still not the same thing. I thought the idea was to match a given pattern against a string as many times as possible, and with each match retreive a list of the backreferences from the match. This way you get your two dimensional array, rows are matches, columns are each backreference in the match. There are as many rows as the pattern matches and as many columns as backreferences in the pattern. Counts of either are available with simple array length tests.

If the regex doesn't have to be repeated or repeated matches don't have to be broken up, this whole thing is as simple as Mike's first response and we're trying to complicated matters unneccessarily.

________________________________________
Andrew

I work for a gift card company!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top