Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

EGREP: slow when using separating choices "|" 2

Status
Not open for further replies.

cptk

Technical User
Mar 18, 2003
305
US
If I have ...

egrep -i "[[:space:]]bird" somefile.txt
the processing time is relatively quick.

However, once I introduce a "choice" in my regexp, the processing time is very slow ...

egrep -i "(^|[[:space:]])bird" somefile.txt
it works as expected, but very slow, epsecially for large files!

Using "sed" and/or "awk" to achieve nearly the same result works a lot faster, BUT neither one can handle case sensitivity!!

Is there a trick to speeding-up this form of egrep?
[note: I'm on Solaris 9 using ksh shell)
 
Hi,

I'm sorry, I not answer the proper question. You see, [tt]sed[/tt] can do case sensitive and case insensitive matching as well.

So, if I got the correct meaning of your command line, this will do the same:

Code:
sed -n '/^bird/Ip; /[[:space:]]bird/Ip' somefile.txt

Feherke.
 
Your cmd doesn't work (for me) ...
I'm not familiar with the "I" cmd attribute and can't find any mention of the "I" ... is this a Linux thing?

I'm on Solaris 5.9 ...
 
Have you tried this...
Code:
egrep -i "^bird|[[:space:]]bird" somefile.txt
[tt]time[/tt] each command to see if their's a difference...
Code:
time egrep -i "(^|[[:space:]])bird" somefile.txt
time egrep -i "^bird|[[:space:]]bird" somefile.txt
Sometimes little changes like that make a big difference, especially on bigger files.

Hope this helps.
 
Sambones -

On a relatively small size file (approx. 16,000 lines), the approx. times are:

#1 = 1 min, 45 seconds
#2 = 48 seconds
A difference of nearly a minute ... interesting !!!

Just for completion & closure of my original problem ...
What I decided to do was run the egrep cmd twice:

1st egrep on just "bird" and save results off to a variable,
set my IFS variable to newline only
IFS='
'
then run the 2nd egrep with (^|[[:space:]])bird on each line value of the saved variable within a for loop.
Time for this: 8 seconds

The key to all this was setting my IFS variable so that each line is recongnized individually in the for loop.

 
And this ?
(egrep -i "[[:space:]]bird" somefile.txt; egrep -i "^bird" somefile.txt) > output



Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
PHV -

Yes, piece-wise method (4.5 seconds) save almost 4 seconds as my two-step egrep method.

...thanks for the input !!!
 
How about this...
Code:
fgrep -i bird somefile.txt | egrep -i "^bird|[[:space:]]bird"
The [tt]fgrep[/tt] should be the fastest for fixed strings in large files. Plus piping the output eliminates the variable and messing with IFS which could be a problem with large files.

Hope this helps.
 
Yes, one would "think" the fgrep would be faster then the egrep, but I've found that not to be the case. Running SamBones' example with the fgrep vs. running with an egrep in place of the fgrep, the dual egrep turned out to be slightly faster.

But I do agree with ya, piping the 2nd grep rather then playing with the IFS variable is better.

...thanks!
 
Oh, one more thing ...

The reason I have to stick with using the IFS variable method is because I need to retain the original line #'s returned from the 1st grep (i.e. - using the -n option in the 1st grep).

When I try the pipe method using the -n method, the 2nd grep won't find ^bird due to the output from the first egrep.
example:
3455:bird

Thus I have to use my 2 grep approach with the IFS variable. After my 1st grep, I set the IFS variable, then "sed" it with a replacement of the ":" with a space, then run my 2nd grep though the new revised line all within a for loop ...

...it's working perfect!




If I supply
 
Wouldn't that just mean your second [tt]egrep[/tt] should look for [tt]'^[0-9]*:bird'[/tt] instead of [tt]'^bird'[/tt]? Something like...
Code:
egrep -in bird somefile.txt | egrep -i "^[0-9]*:bird|[[:space:]]bird"
I just don't like all that "sed" and IFS tweaking and looping with the variables and the greps and so on and so forth. Too many contortions for such a simple task. Just a simple little pipe with two egreps, one at each end. [pipe]

Hope this helps.
 
sorry to butt in to this post - but what exactly is the regex trying to match?

is it either:-
1. ^ beginning of line - then bird
OR (alternation)
2. [[:space:]] a single space? - then bird

is this correct?

sorry - just not seen the [[:space:]] syntax used before


Kind Regards
Duncan
 
Very good point ... at first I too "cringed" at the fact that I chose to use the IFS varaible, but I got it working.

Duh, yeah I didn't even think of prefixing my 2nd egrep with [0-9]:, so naturally I will see if this can truely be applied to my "real" script.

That's why I like this site - great advice, good people ... thanks Sammy!!!
 
duncdude,

The regex [tt]^[0-9]*:bird[/tt] is matching; from the beginning of the line ([tt]^[/tt]), one or more numeric digits ([tt][0-9]*[/tt]), then a colon ([tt]:[/tt]), and then the word "[tt]bird[/tt]". (or is that zero or more numeric digits, I keep forgetting).

The regex [tt][[:space:]]bird[/tt] is matching; any "whitespace" (a space, a tab, a ???), and then the word "[tt]bird[/tt]".

And cptk, I agree, sometimes "it ain't pretty but it works" IS a good enough reason!

Hope this helps.
 
Oh, and the pipe character ([tt]|[/tt]) in the regex is an OR. Egrep allows you to look for multiple patterns. It's matching on the first pattern OR the second pattern.

Hope this helps.
 
duncdude,

The [[:space:]] is one in the list of many POSIX character classes. Note though that you most likely need to explicitly define the path of the unix cmd that supports POSIX (unless you have it defined in the path already). When doing a man page on a particular unix cmd, take note of the SYNOPSIS to see if more then one "flavor" of the cmd exist. For Solaris at least, the POSIX cmds are usually in /usr/xpg4/bin directory.


...And Sammy, the "*" matches any number (or none) of the preceeding regexp or character.
 
Thanks! So for one or more numeric characters the expression would be "[tt]^[0-9][0-9]*:bird[/tt]" or "[tt]^[0-9]+:bird[/tt]". Right?
 
the breakdown is this:

"*" - matches zero or more
"+" - matches one or more
"?" - matches zero or one

Another useful one is the range occurrence:
{x,y} - matches any number of occurrences between x & y
{x} - matches exactly x occurrences
{x,} - matches at least x occurrences
 
Looks to me like the regex could be re-written without alternation:-

\bbird\b

i.e. word boundarybirdword boundary


Kind Regards
Duncan
 
to elaborate...

this post seems to initiate from the alternation being too slow - if the word bird is required to be found in any file either at the beginning of a line or with a space before it (of some sort) then the word boundary trick should suffice

i hope i haven't made a fool of myself!?


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top