Match on every array element 2

dlabdo · Sep 7, 2005

@x = ("in the", "skipping along");

@y = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");

I need to create an if statement that basically does this to search the y array...

if (if any whole element in the x array is within any element on the y array then go on)

can someone help me out with this syntax? thanks.

KevinADC · Sep 7, 2005

not really sure what you are asking:

Code:

my @x = ("in the", "skipping along");
my @y = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");

for (@x) {
   foreach my $line (@y) {
      next if ($line !~ /$_/);
      print "$line\n";
   }
}

prints:

we walk in the park
we walk in the dark
we walk skipping along

dlabdo · Sep 7, 2005

yeah, thats what i thought initially i was going to do but thought maybe there was a more efficient way. Thank you for your help!

fishiface · Sep 9, 2005

a more efficient way

seems a little unfair on KevinADC.

Since I shouldn't criticise without contributing,

Code:

my @x = ("in the", "skipping along");
my @y = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");

map { $_ = qr/$_/ } @x; # precompile regexes

Y: foreach my $line (@y) {
  foreach my $re (@x) {
    if ( $line =~ $re ) {
      print "$line\n";
      next Y;
} } }

f

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

KevinADC · Sep 9, 2005

benchmarking reveals which is more efficient, at least for the sample data used:

(This chart is sorted from slowest to fastest, and shows the percent speed difference between each pair of tests.)

Code:

Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  2 wallclock secs ( 2.14 usr +  0.00 sys =  2.14 CPU) @ 23364.49/s (n=50000)
     Kevin:  1 wallclock secs ( 1.43 usr +  0.00 sys =  1.43 CPU) @ 34965.03/s (n=50000)
         Rate  Fish Kevin
Fish  23364/s    --  -33%
Kevin 34965/s   50%    --

here is the code to benchmark:

Code:

#!perl -w
use strict;
use Benchmark qw( timethese cmpthese ) ;

my @x = ("in the", "skipping along");
my @y = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");

my $results = timethese(50000, 
        {
            'Kevin' => \&Kevin,
            'Fish' => \&Fish,
        },
    );
cmpthese( $results ) ;

sub Kevin {
for (@x) {
   foreach my $line (@y) {
      next if ($line !~ /$_/);
      #print "$line\n";
   }
}
}

sub Fish {

map { $_ = qr/$_/ } @x; # precompile regexes

Y: foreach my $line (@y) {
  foreach my $re (@x) {
    if ( $line =~ $re ) {
      #print "$line\n";
      next Y;
} } }
}

the print lines are commented out (I don't want to see the output 100,000 times!) but should not affect the results since they both do the exact same thing.

KevinADC · Sep 9, 2005

this might be a fairer test but I am not entirely sure:

Code:

#!perl -w
use strict;

my @x = ("in the", "skipping along");
my @w = @x;
my @y = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");
map { $_ = qr/$_/ } @w; # precompile regexes


my $results = timethese(50000, 
        {
            'Kevin' => \&Kevin,
            'Fish' => \&Fish,
        },
    );
cmpthese( $results ) ;



sub Kevin {
for (@x) {
   foreach my $line (@y) {
      next if ($line !~ /$_/);
      #print "$line\n";
   }
}
}

sub Fish {

Y: foreach my $line (@y) {
      foreach my $re (@w) {
         if ( $line =~ $re ) {
            #print "$line\n";
            next Y;
         }
      }
   }
}

in which case the results are very different:

Code:

Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish:  2 wallclock secs ( 1.32 usr +  0.00 sys =  1.32 CPU) @ 37878.79/s (n=50000)
     Kevin:  4 wallclock secs ( 4.29 usr +  0.00 sys =  4.29 CPU) @ 11655.01/s (n=50000)
         Rate Kevin  Fish
Kevin 11655/s    --  -69%
Fish  37879/s  225%    --

fishiface · Sep 18, 2005

(been away)

Your first test is broken because the map changes the contents of @x and, at the second and subsequent invocations, is using the already-compiled regex as input to the compilation.

The second one is unfair because the regexes only get compiled once per run rather than once per test cycle.

I think a fair test would include

Code:

sub Fish {
    my @X = map { qr/$_/ } @x; # precompile regexes
    Y: foreach my $line (@y) {
        foreach my $re (@X) {
            if ( $line =~ $re ) {
                #print "$line\n";
                next Y;
    } } }
}

which gives

Code:

Benchmark: timing 50000 iterations of Fish, Kevin...
      Fish: 10 wallclock secs ( 4.73 usr +  0.00 sys =  4.73 CPU) @ 10570.82/s (n=50000)
     Kevin:  7 wallclock secs ( 3.76 usr +  0.01 sys =  3.77 CPU) @ 13262.60/s (n=50000)
         Rate  Fish Kevin
Fish  10571/s    --  -20%
Kevin 13263/s   25%    --

I'm not sure why my two "common sense" optimisations prove so expensive with this problem set. I've added counters that showed that, in benchmark code modified as above, Kevin() has 400,000 regex match evaluations against 300,000 for Fish(). More significantly, Kevin() has 400,000 regex compilations whereas Fish() only has 100,000.

I'm still confident that short-circuiting the testing on a match and pre-compiling the regexes would provide a more measurable saving with a larger problem set, but I'm fascinated by it's failings in this instance.

Can anyone shed light?

f

PS - thanks for the idiot's intro to Benchmarking - I'd not explored it because I'd not expected it to be so simple to use. Another tool in the box [smile2]

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

KevinADC · Sep 18, 2005

I am also puzzled at the results, I was sure your way (using
qr//) was going to be faster, or at least more efficient.

I still am not sure even if your revised test is fair, although I now see why my tests were not. I think compiling the @X array just once, before the iterations, would be fairer.

The Benchmark module is very easy to use, the documentation is quite readable and has several examples of how to use various functions.

ishnid · Sep 19, 2005

Something to add to the ``when to use regexps'' debate! Here's a benchmark against Kevin's original code and fish's ``fair'' test. The code I've used (avoiding regexps altogether) is:

Code:

sub ishnid {
   for ( @y ) {
      for my $substr ( @x ) {
         next unless ( index( $_, $substr ) + 1 );
#        print "$_\n";
         last;
      }
   }
}

Again, I've commented the print line. The reason I'm not using regexps is because the strings we're matching are only substrings, not patterns. I got the ``too few iterations'' warning for 50000 iterations, so I ran it over 100,000 too - here are both results:

Code:

Benchmark: timing 50000 iterations of Fish, Kevin, ishnid...
      Fish:  2 wallclock secs ( 1.44 usr +  0.00 sys =  1.44 CPU) @ 34722.22/s (n=50000)
     Kevin:  1 wallclock secs ( 1.13 usr +  0.00 sys =  1.13 CPU) @ 44247.79/s (n=50000)
    ishnid:  0 wallclock secs ( 0.31 usr +  0.00 sys =  0.31 CPU) @ 161290.32/s (n=50000)
            (warning: too few iterations for a reliable count)
           Rate   Fish  Kevin ishnid
Fish    34722/s     --   -22%   -78%
Kevin   44248/s    27%     --   -73%
ishnid 161290/s   365%   265%     --

Code:

Benchmark: timing 100000 iterations of Fish, Kevin, ishnid...
      Fish:  3 wallclock secs ( 2.85 usr +  0.01 sys =  2.86 CPU) @ 34965.03/s (n=100000)
     Kevin:  3 wallclock secs ( 2.16 usr +  0.00 sys =  2.16 CPU) @ 46296.30/s (n=100000)
    ishnid:  1 wallclock secs ( 0.64 usr +  0.00 sys =  0.64 CPU) @ 156250.00/s (n=100000)
           Rate   Fish  Kevin ishnid
Fish    34965/s     --   -24%   -78%
Kevin   46296/s    32%     --   -70%
ishnid 156250/s   347%   237%     --

fishiface · Sep 19, 2005

Point taken.

Given the seductive nature of regexes in perl (resulting from the tight integration), I wonder whether the regex mechanism shouldn't degrade to using index internally when the pattern contains no metacharacters.

I can't yet prove to myself that it would always provide a performance gain but it "feels" right. Whether it's philosophically better is moot. Pragmatically, I'd like the extra performance when I've missed a trick or been lazy but, on the other hand, when I've got the right tool in the box I feel I shouldn't be trying to bend another into inappropriate service.

f

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

ishnid · Sep 19, 2005

You could also argue that using regexps for matching substrings is ``inappropriate service'' too!

Such is the beauty of TIMTOWDTI. An internal fallback on `index' would indeed be a good idea though - particularly since the programmer isn't always necessarily going to know in advance whether the pattern is a true pattern or just a string.

As an aside, this also brings another Perl motto into play. ``Don't do nothing when you can do something useful''. By returning a numeric value (-1) when a substring isn't found, the `index' function makes itself extremely useful and helps cut down the number of functions needed in the language. Take Java for example - they have contains(), indexOf() and startsWith() methods in the String class. They can all be achieved using the single `index' function in Perl:

Code:

# indexOf (obviously!)
return index( $string, $substring );

# contains
return index( $string, $substring ) + 1;

# startsWith
return ! index( $string, $substring );

# or even better (though more complex)
return ! rindex( $string, $substring, 0 );

duncdude · Sep 19, 2005

i'm probably losing my marbles but i can't see this approach above?

i.e. only 1 loop (not nested loops)

Code:

[b]#!/usr/bin/perl[/b]

@compare = ("in the", "skipping along");
@walking = ("we walk in the park", "we walk in the dark", "we walk holding hands", "we walk skipping along");

foreach $compare_item (@compare) {
  print join("\n", grep /$compare_item/, @walking);
}

Kind Regards
Duncan

ishnid · Sep 19, 2005

Technically, the use of grep introduces another loop, without the benefit of being able to break out of the loop when a match is found - i.e. it'll continue making comparisons after it needs to.

TrojanWarBlade · Sep 19, 2005

It might be interesting to test it for performance though since one would imagine that the "grep" processing should be optimised and run much faster than a hand coded loop.

Trojan.

fishiface · Sep 19, 2005

That's still a loop, just different syntax. It's quicker to type but it prevents you from easily short-circuiting the inner loop, if you wanted to do that.

I think we've all been ignoring the OP:

if any whole element in the x array is within any element on the y array...

which suggests that all further tests should be abandoned after the first match. Kevin and you test every phrase against every pattern and I stopped testing each phrase after the first match (but carried on redundantly testing subsequent phrases).

Ishnid:

using regexps for matching substrings is inappropriate

That was (meant to be) exactly my point. I sometimes struggle to express myself clearly on Monday mornings!

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

ishnid · Sep 19, 2005

Again, I've removed the `print' statement (so the join is in a void context) for comparability.

Code:

Benchmark: timing 100000 iterations of Duncan, Fish, Kevin, ishnid...
    Duncan:  2 wallclock secs ( 1.90 usr +  0.00 sys =  1.90 CPU) @ 52631.58/s (n=100000)
      Fish:  4 wallclock secs ( 2.85 usr +  0.01 sys =  2.86 CPU) @ 34965.03/s (n=100000)
     Kevin:  2 wallclock secs ( 1.98 usr +  0.00 sys =  1.98 CPU) @ 50505.05/s (n=100000)
    ishnid:  1 wallclock secs ( 0.66 usr +  0.01 sys =  0.67 CPU) @ 149253.73/s (n=100000)
           Rate   Fish  Kevin Duncan ishnid
Fish    34965/s     --   -31%   -34%   -77%
Kevin   50505/s    44%     --    -4%   -66%
Duncan  52632/s    51%     4%     --   -65%
ishnid 149254/s   327%   196%   184%     --

TrojanWarBlade · Sep 19, 2005

Wow, painful!

Trojan.

KevinADC · Sep 19, 2005

very interesting stuff, and a great example of why using index() when looking for sub strings is so much faster than using regexps, which should be used for searching for patterns.

Too bad most of the perl resource materials and online tutorials don't stress this enough. Regular expressions (for good reason) get lots of attention, but the string functions like index(), substr(), rindex(), hardly get more than a very short explanation.

I've also noted how freaking slow the join() function is. That alone can make a script slow to a crawl. I avoid it as much as possible myself. That really should be fixed, I don't understand why that seemingly simple function is so slow.

a star for ishnid for the excellent information.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Match on every array element 2

Programmer

Technical User

Programmer

IS-IT--Management

Technical User

Technical User

IS-IT--Management

Technical User

Programmer

IS-IT--Management

Programmer

Programmer

Programmer

Programmer

IS-IT--Management

Programmer

Programmer

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor