Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

grep vs regular expr 1

Status
Not open for further replies.

Corwinsw

Programmer
Sep 28, 2005
117
BG
Hi guys.
Directly to the point. I have about 1G log files and for example dates and keys and some atributes. On those rows where those keys and dates match I should check for the attributes, and if there is, extract it's value.
I have made it with regular exprs, which check every row for date, key and attribute and if match, extract the value.
The problem is that it is too slow. For 50Mb it takes over 15 mins, which is totally unacceptable.
So I am thinking to use grep first for date, then for the keys of the matched dates. After that use regexprs for my attributes.
So my question is. Which is faster reg expr or grep. Also somebody told me that the shell grep is faster and I should use it. I'll benchmark all variants, but I wanna know your opinions and do you have any other ideas for optimizing the search.
PS. I am really newbie, so I don't know if the question is stupid, and even I am not sure about my english, but I would appreciate any help.

Corwin
 
If you post the log file, I'm sure you'll get plenty of suggestions to benchmark. Can you also give an example of exactly what you're trying to do. I can quite safely say that we enjoy these type of challenges on this forum!

Your existing code would be useful too - maybe improvements can be suggested.
 
You did not specify which langauge you use for your regexs.
Grep uses regexs, as does perl and many other languages and tools.
If you post some sample data from your source and show us what you want the output to look like we'll have a go at writing something for you that performs well.



Trojan.
 
While I am regularly proved wrong (mostly by ishnid), I'll stick my neck out and say that clean perl code tends to beat unix grep hands down wherever I have tried both - to the point that I seldom bother trying grep-based solutions anymore (although now that I've discovered [tt]use Benchmark;[/tt] I might look again).

Having said that, post some data and we'll have a go[upsidedown]

fish

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur
 
I don't have time now, so I'll post the code later, but have another quick question now. How can I interpolate arrind in reg expr.
1.
Code:
      if (/\[$IP[$count]\[$PID[$count]\]/) {
         print "$_\n";
      }
[\code]  
2.
[code]
      if (/\[$IP\[$count\]\[$PID\[$count\]\]/) {
         print "$_\n";
      }
[\code]

Both don't work.




Corwin
 
Assign each element to a scalar first and use the scalar in the regex.
E.G.
Code:
      my $ip  = $IP[$count];
      my $pid = $PID[$count];
      if (/\[$ip\[$pid\]/) {
         print "$_\n";
      }


Trojan.
 
Yes could be done this way but actually my 1st way was right. I have just missed one more [.
So:
if ($/\[$IP[$count]\]\[$PID[$count]\]/) {
sfas
}

Corwin
 
Do you have regular expression patterns in your @IP and @PID arrays? If not, then you shouldn't be using regular expressions to identify which lines you want. Since you're trying to find exactly matching strings, you should use the index() function rather than firing up the regular expression engine:
Code:
my $to_match = "[$IP[$count]][$PID[$count]]";
if ( index( $_, $to_match ) + 1 ) {
   # found it!
}
We've benchmarked this recently and this should be faster for you.
 
Ahm I see.
I had sth like
if ($/\[$IP[$count]\]\[$PID[$count]\] Attr =\[([^\]]*)\] /) {
some use of $1
}
but it could be done with index. Actually why is index better choice?


Corwin
 
index() is considerably faster than using regular expressions when you are looking for a substring. Regular expressions are for searching for patterns.
 
for 100 iterations:

Benchmark: timing 100 iterations of Index, LinuxGrep, PerlGrep, RegExpr...
Index: 6 wallclock secs ( 5.40 usr + 0.25 sys = 5.65 CPU) @ 17.70/s (n=100)
LinuxGrep: 1 wallclock secs ( 0.06 usr 0.02 sys + 0.38 cusr 0.42 csys = 0.88 CPU) @ 1250.00/s (n=100)
PerlGrep: 12 wallclock secs (11.41 usr + 0.54 sys = 11.95 CPU) @ 8.37/s (n=100)
RegExpr: 5 wallclock secs ( 5.25 usr + 0.25 sys = 5.50 CPU) @ 18.18/s (n=100)
Rate PerlGrep Index RegExpr LinuxGrep
PerlGrep 8.37/s -- -53% -54% -99%
Index 17.7/s 112% -- -3% -99%
RegExpr 18.2/s 117% 3% -- -99%
LinuxGrep 1250/s 14837% 6962% 6775% --

for 1000 iterations:



Benchmark: timing 1000 iterations of Index, LinuxGrep, PerlGrep, RegExpr...
Index: 57 wallclock secs (53.59 usr + 2.90 sys = 56.49 CPU) @ 17.70/s (n=1000)
LinuxGrep: 10 wallclock secs ( 0.55 usr 1.20 sys + 3.31 cusr 5.24 csys = 10.30 CPU) @ 571.43/s (n=1000)
PerlGrep: 120 wallclock secs (114.41 usr + 5.52 sys = 119.93 CPU) @ 8.34/s (n=1000)
RegExpr: 52 wallclock secs (48.88 usr + 2.99 sys = 51.87 CPU) @ 19.28/s (n=1000)
Rate PerlGrep Index RegExpr LinuxGrep
PerlGrep 8.34/s -- -53% -57% -99%
Index 17.7/s 112% -- -8% -97%
RegExpr 19.3/s 131% 9% -- -97%
LinuxGrep 571/s 6753% 3128% 2864% --

I did't think that Linux grep is so unbelivable fast...
So my question is why would sbd use Perl grep?
And also why Index is slower than RegExpr?

Corwin
 
Without seeing the code you're using in each instance, we can only speculate as to what the reasons are. Can you post the 4 subroutines you're running.
 
Sry.

Code:
#!/usr/bin/perl -w
use strict;
use Benchmark qw(:all);
my (@arr, @gr);

cmpthese(100,{
      'RegExpr'  => sub {
         open FH, "< dcp-wce-error.log.1";
         while (<FH>) {
            if (/commit/) {
               push @arr, $_;
            }
         }
      },

      'PerlGrep' => sub {
         open FH, "< dcp-wce-error.log.1";
         @gr = grep(/commit/, <FH>);
      },

      'Index'    => sub {
         open FH, "< dcp-wce-error.log.1";
         while (<FH>) {
            if (index($_,"commit")>-1) {
               push @arr,$_;
            }
         }
      },

      'LinuxGrep'=> sub {
         my @gr= `grep commit dcp-wce-error.log.1`;
      },
   }
);

The file is about 60000 rows

Corwin
 
Ouch. Not quite as devastating but certainly verified:
Code:
           Rate  PerlGrep     Index   RegExpr LinuxGrep
PerlGrep  104/s        --      -46%      -48%      -79%
Index     193/s       85%        --       -4%      -62%
RegExpr   200/s       92%        4%        --      -60%
LinuxGrep 503/s      381%      161%      151%        --

It's not just the speed of the grep binary that impresses me - it's the speed with which the kernel forks the subprocess, loads the binary and dynamically links the necessary libraries.

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Can we try the same thing with the data in memory?
Obviously the linuxgrep will be at a disadvantage there but I am concerned about the file handling issues here.


Trojan.
 
Code:
#!/usr/bin/perl -w

use strict;
use Benchmark qw(:all);

open( FH, '/var/log/syslog.0' );
my @data = <FH>;
close FH;

cmpthese( 1000, {
        'RegExpr'  => sub {
                my @arr;
                foreach (@data) {
                        push @arr, $_ if /started/;
                }
        },
        'PerlGrep' => sub {
                my @gr;
                @gr = grep(/started/, @data );
        },

        'Index'    => sub {
                my @arr;
                foreach (@data) {
                        push @arr, $_ if index( $_, "started" );
                }
        },

        'LinuxGrep'=> sub {
                my @gr= `grep started /var/log/syslog.0`;
        },
} );
Code:
           Rate     Index   RegExpr  PerlGrep LinuxGrep
Index     168/s        --      -61%      -63%      -77%
RegExpr   433/s      158%        --       -6%      -42%
PerlGrep  459/s      174%        6%        --      -38%
LinuxGrep 741/s      342%       71%       61%        --

I have to say that I'm finding this all counter-intuitive.

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
I agree but the findings are impressive.
We must remember though that we are doing a dumb literal search and that's not what regexs are all about. Grep has an algorithm that looks for literal strings and treats them differently (hence fgrep) so I suggest that we also need to consider more relevant real world regexs to get a good picture of what is happening here. Obviously index cannot do that so we would have to ignore it for this test but it would be interesting none the less.


Trojan.
 
Also it's worth remembering that grep is an optimised compiled executable tool whereas any perl code we write has to be interpreted in it's bytecode form and therefore can never match the speed of dedicated code like grep.


Trojan.
 
I tried a version where the regex was precompiled and that made it even worse.

I don't know: this all makes me feel uncomfortable. I'm happy that I often write sub-optimally efficient code for the sake of clarity (and thus, hopefully, reliability and maintainability) and I think I've got a reasonable nose for knowing when efficiency really matters.

That, however, is no argument for not writing
Code:
my @gr= `grep started /var/log/syslog.0`;
instead of
Code:
my @gr = grep(/started/, @data );
- even if I have to populate @data first - but I have a powerful feeling that I don't want to. It simply doesn't seem right or clean and I can't explain it any more logically than that.

It should be good to have another tool in the kit but I feel somehow let down by perl.

I probably need to write a JAPH or something to cheer myself up and rekindle the fire.

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Fish,
Perl is a scripting language and a very fine one at that.
Don't be disappointed, it can never compete with tools like grep because grep cannot perform any operations on the data it filters. Perl is a langauge that allows you to do anything you like. You pay for the performance with versatility. If performance is your number 1 priority then go learn C/C++ or better still, assembler. You will get much better performance there but it will take you much longer to develop your solutions.
Perl is a scripting language and part compiled and part interpreted (bytecodes). If you want to compare it to anything, compare it to Java. Java is compiled and interpreted in a similar way. The obvious difference with Java is that the compilation phase is completely separate.
I think you'll find though that perl compares very favourably in terms of performance against Java.


Trojan.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top