grep vs regular expr 1

Corwinsw · Oct 2, 2005

Hi guys.
Directly to the point. I have about 1G log files and for example dates and keys and some atributes. On those rows where those keys and dates match I should check for the attributes, and if there is, extract it's value.
I have made it with regular exprs, which check every row for date, key and attribute and if match, extract the value.
The problem is that it is too slow. For 50Mb it takes over 15 mins, which is totally unacceptable.
So I am thinking to use grep first for date, then for the keys of the matched dates. After that use regexprs for my attributes.
So my question is. Which is faster reg expr or grep. Also somebody told me that the shell grep is faster and I should use it. I'll benchmark all variants, but I wanna know your opinions and do you have any other ideas for optimizing the search.
PS. I am really newbie, so I don't know if the question is stupid, and even I am not sure about my english, but I would appreciate any help.

Corwin

TrojanWarBlade · Oct 7, 2005

It might also be worth pointing out that you can take advantage of "grep" and get (to some extent) the best of both worlds by using a pipeline and only processing (in perl) the lines from the file that actually need processing:

Code:

open PIPE, "grep commit logfile | ";
while(<PIPE>) {
  # Process records selected by "grep" here
}

Trojan.

icrf · Oct 7, 2005

While I pretty much feel like fish, I guess I can see why. Linux grep is just using a pre-compiled and optimized version of PCRE, the same regular expression engine Perl uses. All the program does is look for patterns in a file, so it should be very good at it. If it was slower than a generalized scripting language's implimentation, I think I'd want my money back.

On the flip side, if you were looking for the existence of several patterns in a single larger string, instead of line-by-line searching, using Perl's study would likely yeild much better results than even fgrep.

Then again, specialization's for insects.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

Corwinsw · Oct 9, 2005

Ok, so I have another question, why is index so slow, after it is supposed to be much faster than grep and regexp?

Corwin

TrojanWarBlade · Oct 10, 2005

I started a thread (thread219-1122893) on this very subject some time ago since I don't believe that index and substr are always such a wise choice. People often suggest them for performance reasons and they can sometimes offer an advantage but I find them to cause problems too.

Trojan.

KevinADC · Oct 10, 2005

retry the index without the explicit check:

change this:

if (index($_,"commit")>-1) {

to:

if (index($_,"commit")) {

and see how much speed improvement you get. If you are just checking to see if the substring exists you don't have to actually use a comparison operator. You would only use the comparison if you wanted to know where the substring was.

TrojanWarBlade · Oct 10, 2005

????
index returns -1 for failure (which is considered TRUE), or the position of the substring (zero (which is FALSE) or greater (which is TRUE)) so your suggestion doesn't make sense at all. You would get TRUE unless is started exactly at the beginning.

Trojan.

Corwinsw · Oct 10, 2005

Ahm I looked at it.

Another question. At Mastering Perl Algorithms I read, something like:

print $var,"\n" is 5% faster than
print $var."\n" and 30% faster than
print "$var\n"

My question is:
Why does that happens?

P.S. I hope I am not becoming tiresome with those type of questions.

Corwin

KevinADC · Oct 10, 2005

TWB,

my bad, you are totally correct.

TrojanWarBlade · Oct 10, 2005

Your third example is the slowest because it uses "interpolation" . Interpolation can be costly since the string space has to be managed to allow expansion of variables to shuffle the rest of the string around to make space for the value of the interpolated variable(s).
In the first two cases, the string is not altered, it is constructed by concatenation which is obviously far quicker.

Trojan.

Corwinsw · Oct 10, 2005

Cite colleague of mine:

Benchmark module measures performance by running the code provided N times
and then taking the user and system times of the process. In the case of
LinuxGrep, the backticks `` launch another process, whose time is not taken
into account (and this is actually the real execution time). So what you
tested is how fast can perl make an empty loop. And the result is ~1K/sec and
not more because of the measurement error you get every time (times are
rounded to 10ms in linux kernel)

Corwin

ishnid · Oct 10, 2005

The figures in fish's last benchmark are a little misleading. As Trojan says above, using the return value of index() to test for truth results in a match for every case other than when the line starts with "started". That means that the subroutine with `index' in it is causing far more calls to the `push' function than the rest, which will negatively impact its performance.

icrf · Oct 10, 2005

Also, remember that perl's grep is not a search operator, it's a list operator. The difference above between PerlGrep and RegExpr is really a benchmark between a foreach block and a grep call, since both are using a regex for the test. By the same token, you might shave a little more off index's time by using grep as well. Every time you have a set of { } in perl, it has to create scope and such, which takes a little time.

Corwin, is your colleague suggesting that LinuxGrep would be even faster if there were more timing resolution?

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

Corwinsw · Oct 10, 2005

No he says that what Benchmark takes is an empty loop and Linux grep is only about 10% faster than Perl grep.

Corwin

TrojanWarBlade · Oct 10, 2005

I still think that people look at performance from the wrong point of view.
You'll probably find that in general, perl is plenty fast enough for most things that you are ever likely to do and on the odd occasion where it appears to be too slow you may well find that it is poorly written code that is causing the performance issue. You can usually resolve those kind of issues by profiling your code. I gave an example above of where you can harness any performance advantage that Linux Grep might have in your perl scripts.

Trojan.

fishiface · Oct 10, 2005

I'm very glad that we now understand why the LinuxGrep figures are so misleading!

I recoded using the return value of index correctly ([tt]>-1[/tt]) and re-ran, getting

Code:

            Rate     Index   RegExpr  PerlGrep LinuxGrep
Index      962/s        --        0%      -11%      -44%
RegExpr    962/s        0%        --      -11%      -44%
PerlGrep  1075/s       12%       12%        --      -38%
LinuxGrep 1724/s       79%       79%       60%        --

which is not too far from what I would expect, given our new knowledge of BenchMark's limitations.

So how can we get a feel for the performance of command-line grep, taking into account the time to load the binary and libraries?

I made a shell script with 1024 identical lines (after the shebang) like this

Code:

#!/bin/sh
grep started /var/log/syslog.0
grep started /var/log/syslog.0
grep started /var/log/syslog.0
grep started /var/log/syslog.0
grep started /var/log/syslog.0
...

and timed it:

Code:

fish@spider:~# time ./t.sh >/dev/null

real    0m3.218s
user    0m1.230s
sys     0m1.390s

so, crudely, adding tUser and tSys and dividing into 1024 gives a headline rate of about 390 per second.

I don't expect that the shell added significant overhead, so this figure is the closest we've got in this thread to a real benchmark of this technique (spawning a command-line grep rather than using a perl built-in) as it is timing the invocation as well as the execution.

We've covered a lot of ground in this thread, so I'm going to attempt a potted summary:

1. TMTOWTDI
2. When choosing a technique, performance is rarely the most important issue. Don't be needlessly wasteful and you'll be fine in 99.62% of cases.
3. There are many cases where [tt]substr[/tt], [tt]index[/tt] &c are arguably "better tools for the job" than regexes but it's not hugely significant in performance terms so, IMHO, use the syntax that most clearly conveys your intention. In the example we have been analysing, my feeling is that the "perl grep" syntax is the cleanest and the most easy to read.
4. It is not always useful to reach outside of perl for command-line utilities - the overheads of the context changes are relatively high - but, if you are doing a huge ammount of, say, grepping, then Trojan's pipelining makes a lot of sense and is very easy to read.

Yours,

fish

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

Corwinsw · Oct 11, 2005

cite again my colleague:
[cite]
So another conclusion you should add to the list:

"When a test gives you really strange results, don't believe blindly, but try
to understand why. It could show that your testbed is not correct."
[/cite]

Corwin

fishiface · Oct 11, 2005

and a deserved star for that, Corwin.

f

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

Corwinsw · Oct 11, 2005

just cited the colleague

Corwin

fishiface · Oct 11, 2005

OK - he can share it [2thumbsup]

["]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.["]
--Maur

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

grep vs regular expr 1

Programmer

Programmer

Programmer

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

IS-IT--Management

Programmer

IS-IT--Management

Programmer

IS-IT--Management

Similar threads

Log in

Part and Inventory Search

Sponsor