Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

grep vs regular expr 1

Status
Not open for further replies.

Corwinsw

Programmer
Sep 28, 2005
117
BG
Hi guys.
Directly to the point. I have about 1G log files and for example dates and keys and some atributes. On those rows where those keys and dates match I should check for the attributes, and if there is, extract it's value.
I have made it with regular exprs, which check every row for date, key and attribute and if match, extract the value.
The problem is that it is too slow. For 50Mb it takes over 15 mins, which is totally unacceptable.
So I am thinking to use grep first for date, then for the keys of the matched dates. After that use regexprs for my attributes.
So my question is. Which is faster reg expr or grep. Also somebody told me that the shell grep is faster and I should use it. I'll benchmark all variants, but I wanna know your opinions and do you have any other ideas for optimizing the search.
PS. I am really newbie, so I don't know if the question is stupid, and even I am not sure about my english, but I would appreciate any help.

Corwin
 
It might also be worth pointing out that you can take advantage of "grep" and get (to some extent) the best of both worlds by using a pipeline and only processing (in perl) the lines from the file that actually need processing:
Code:
open PIPE, "grep commit logfile | ";
while(<PIPE>) {
  # Process records selected by "grep" here
}


Trojan.
 
While I pretty much feel like fish, I guess I can see why. Linux grep is just using a pre-compiled and optimized version of PCRE, the same regular expression engine Perl uses. All the program does is look for patterns in a file, so it should be very good at it. If it was slower than a generalized scripting language's implimentation, I think I'd want my money back.

On the flip side, if you were looking for the existence of several patterns in a single larger string, instead of line-by-line searching, using Perl's study would likely yeild much better results than even fgrep.

Then again, specialization's for insects. :)

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
Ok, so I have another question, why is index so slow, after it is supposed to be much faster than grep and regexp?

Corwin
 
I started a thread (thread219-1122893) on this very subject some time ago since I don't believe that index and substr are always such a wise choice. People often suggest them for performance reasons and they can sometimes offer an advantage but I find them to cause problems too.


Trojan.
 
retry the index without the explicit check:

change this:

if (index($_,"commit")>-1) {

to:

if (index($_,"commit")) {

and see how much speed improvement you get. If you are just checking to see if the substring exists you don't have to actually use a comparison operator. You would only use the comparison if you wanted to know where the substring was.
 
????
index returns -1 for failure (which is considered TRUE), or the position of the substring (zero (which is FALSE) or greater (which is TRUE)) so your suggestion doesn't make sense at all. You would get TRUE unless is started exactly at the beginning.


Trojan.
 
Ahm I looked at it.

Another question. At Mastering Perl Algorithms I read, something like:

print $var,"\n" is 5% faster than
print $var."\n" and 30% faster than
print "$var\n"

My question is:
Why does that happens?

P.S. I hope I am not becoming tiresome with those type of questions.

Corwin
 
Your third example is the slowest because it uses "interpolation" . Interpolation can be costly since the string space has to be managed to allow expansion of variables to shuffle the rest of the string around to make space for the value of the interpolated variable(s).
In the first two cases, the string is not altered, it is constructed by concatenation which is obviously far quicker.


Trojan.
 
Cite colleague of mine:

Benchmark module measures performance by running the code provided N times
and then taking the user and system times of the process. In the case of
LinuxGrep, the backticks `` launch another process, whose time is not taken
into account (and this is actually the real execution time). So what you
tested is how fast can perl make an empty loop. And the result is ~1K/sec and
not more because of the measurement error you get every time (times are
rounded to 10ms in linux kernel)


Corwin
 
The figures in fish's last benchmark are a little misleading. As Trojan says above, using the return value of index() to test for truth results in a match for every case other than when the line starts with "started". That means that the subroutine with `index' in it is causing far more calls to the `push' function than the rest, which will negatively impact its performance.
 
Also, remember that perl's grep is not a search operator, it's a list operator. The difference above between PerlGrep and RegExpr is really a benchmark between a foreach block and a grep call, since both are using a regex for the test. By the same token, you might shave a little more off index's time by using grep as well. Every time you have a set of { } in perl, it has to create scope and such, which takes a little time.

Corwin, is your colleague suggesting that LinuxGrep would be even faster if there were more timing resolution?

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
No he says that what Benchmark takes is an empty loop and Linux grep is only about 10% faster than Perl grep.

Corwin
 
I still think that people look at performance from the wrong point of view.
You'll probably find that in general, perl is plenty fast enough for most things that you are ever likely to do and on the odd occasion where it appears to be too slow you may well find that it is poorly written code that is causing the performance issue. You can usually resolve those kind of issues by profiling your code. I gave an example above of where you can harness any performance advantage that Linux Grep might have in your perl scripts.


Trojan.
 
I'm very glad that we now understand why the LinuxGrep figures are so misleading!

I recoded using the return value of index correctly ([tt]>-1[/tt]) and re-ran, getting
Code:
            Rate     Index   RegExpr  PerlGrep LinuxGrep
Index      962/s        --        0%      -11%      -44%
RegExpr    962/s        0%        --      -11%      -44%
PerlGrep  1075/s       12%       12%        --      -38%
LinuxGrep 1724/s       79%       79%       60%        --
which is not too far from what I would expect, given our new knowledge of BenchMark's limitations.

So how can we get a feel for the performance of command-line grep, taking into account the time to load the binary and libraries?

I made a shell script with 1024 identical lines (after the shebang) like this
Code:
#!/bin/sh
grep started /var/log/syslog.0
grep started /var/log/syslog.0
grep started /var/log/syslog.0
grep started /var/log/syslog.0
grep started /var/log/syslog.0
...
and timed it:
Code:
fish@spider:~# time ./t.sh >/dev/null

real    0m3.218s
user    0m1.230s
sys     0m1.390s
so, crudely, adding tUser and tSys and dividing into 1024 gives a headline rate of about 390 per second.

I don't expect that the shell added significant overhead, so this figure is the closest we've got in this thread to a real benchmark of this technique (spawning a command-line grep rather than using a perl built-in) as it is timing the invocation as well as the execution.

We've covered a lot of ground in this thread, so I'm going to attempt a potted summary:

1. TMTOWTDI
2. When choosing a technique, performance is rarely the most important issue. Don't be needlessly wasteful and you'll be fine in 99.62% of cases.
3. There are many cases where [tt]substr[/tt], [tt]index[/tt] &c are arguably "better tools for the job" than regexes but it's not hugely significant in performance terms so, IMHO, use the syntax that most clearly conveys your intention. In the example we have been analysing, my feeling is that the "perl grep" syntax is the cleanest and the most easy to read.
4. It is not always useful to reach outside of perl for command-line utilities - the overheads of the context changes are relatively high - but, if you are doing a huge ammount of, say, grepping, then Trojan's pipelining makes a lot of sense and is very easy to read.

Yours,

fish

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
cite again my colleague:
[cite]
So another conclusion you should add to the list:

"When a test gives you really strange results, don't believe blindly, but try
to understand why. It could show that your testbed is not correct."
[/cite]

Corwin
 
and a deserved star for that, Corwin.

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
OK - he can share it [2thumbsup]

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top