EGREP: slow when using separating choices "|" 2

cptk · Jul 5, 2005

If I have ...

egrep -i "[[:space:]]bird" somefile.txt
the processing time is relatively quick.

However, once I introduce a "choice" in my regexp, the processing time is very slow ...

egrep -i "(^|[[:space:]])bird" somefile.txt
it works as expected, but very slow, epsecially for large files!

Using "sed" and/or "awk" to achieve nearly the same result works a lot faster, BUT neither one can handle case sensitivity!!

Is there a trick to speeding-up this form of egrep?
[note: I'm on Solaris 9 using ksh shell)

feherke · Jul 8, 2005

Hi,

I'm sorry, I not answer the proper question. You see, [tt]sed[/tt] can do case sensitive and case insensitive matching as well.

So, if I got the correct meaning of your command line, this will do the same:

Code:

sed -n '/^bird/Ip; /[[:space:]]bird/Ip' somefile.txt

Feherke.

http://rootshell.be/~feherke/

cptk · Jul 8, 2005

Your cmd doesn't work (for me) ...
I'm not familiar with the "I" cmd attribute and can't find any mention of the "I" ... is this a Linux thing?

I'm on Solaris 5.9 ...

SamBones · Jul 8, 2005

Have you tried this...

Code:

egrep -i "^bird|[[:space:]]bird" somefile.txt

[tt]time[/tt] each command to see if their's a difference...

Code:

time egrep -i "(^|[[:space:]])bird" somefile.txt
time egrep -i "^bird|[[:space:]]bird" somefile.txt

Sometimes little changes like that make a big difference, especially on bigger files.

Hope this helps.

cptk · Jul 8, 2005

Sambones -

On a relatively small size file (approx. 16,000 lines), the approx. times are:

#1 = 1 min, 45 seconds
#2 = 48 seconds
A difference of nearly a minute ... interesting !!!

Just for completion & closure of my original problem ...
What I decided to do was run the egrep cmd twice:

1st egrep on just "bird" and save results off to a variable,
set my IFS variable to newline only
IFS='
'
then run the 2nd egrep with (^|[[:space:]])bird on each line value of the saved variable within a for loop.
Time for this: 8 seconds

The key to all this was setting my IFS variable so that each line is recongnized individually in the for loop.

PHV · Jul 8, 2005

And this ?
(egrep -i "[[:space:]]bird" somefile.txt; egrep -i "^bird" somefile.txt) > output

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

cptk · Jul 8, 2005

PHV -

Yes, piece-wise method (4.5 seconds) save almost 4 seconds as my two-step egrep method.

...thanks for the input !!!

SamBones · Jul 8, 2005

How about this...

Code:

fgrep -i bird somefile.txt | egrep -i "^bird|[[:space:]]bird"

The [tt]fgrep[/tt] should be the fastest for fixed strings in large files. Plus piping the output eliminates the variable and messing with IFS which could be a problem with large files.

Hope this helps.

cptk · Jul 8, 2005

Yes, one would "think" the fgrep would be faster then the egrep, but I've found that not to be the case. Running SamBones' example with the fgrep vs. running with an egrep in place of the fgrep, the dual egrep turned out to be slightly faster.

But I do agree with ya, piping the 2nd grep rather then playing with the IFS variable is better.

...thanks!

cptk · Jul 8, 2005

Oh, one more thing ...

The reason I have to stick with using the IFS variable method is because I need to retain the original line #'s returned from the 1st grep (i.e. - using the -n option in the 1st grep).

When I try the pipe method using the -n method, the 2nd grep won't find ^bird due to the output from the first egrep.
example:
3455:bird

Thus I have to use my 2 grep approach with the IFS variable. After my 1st grep, I set the IFS variable, then "sed" it with a replacement of the ":" with a space, then run my 2nd grep though the new revised line all within a for loop ...

...it's working perfect!

If I supply

SamBones · Jul 8, 2005

Wouldn't that just mean your second [tt]egrep[/tt] should look for [tt]'^[0-9]*:bird'[/tt] instead of [tt]'^bird'[/tt]? Something like...

Code:

egrep -in bird somefile.txt | egrep -i "^[0-9]*:bird|[[:space:]]bird"

I just don't like all that "sed" and IFS tweaking and looping with the variables and the greps and so on and so forth. Too many contortions for such a simple task. Just a simple little pipe with two egreps, one at each end. [pipe]

Hope this helps.

duncdude · Jul 9, 2005

sorry to butt in to this post - but what exactly is the regex trying to match?

is it either:-
1. ^ beginning of line - then bird
OR (alternation)
2. [[:space:]] a single space? - then bird

is this correct?

sorry - just not seen the [[:space:]] syntax used before

Kind Regards
Duncan

cptk · Jul 9, 2005

Very good point ... at first I too "cringed" at the fact that I chose to use the IFS varaible, but I got it working.

Duh, yeah I didn't even think of prefixing my 2nd egrep with [0-9]:, so naturally I will see if this can truely be applied to my "real" script.

That's why I like this site - great advice, good people ... thanks Sammy!!!

SamBones · Jul 11, 2005

duncdude,

The regex [tt]^[0-9]*:bird[/tt] is matching; from the beginning of the line ([tt]^[/tt]), one or more numeric digits ([tt][0-9]*[/tt]), then a colon ([tt]:[/tt]), and then the word "[tt]bird[/tt]". (or is that zero or more numeric digits, I keep forgetting).

The regex [tt][[:space:]]bird[/tt] is matching; any "whitespace" (a space, a tab, a ???), and then the word "[tt]bird[/tt]".

And cptk, I agree, sometimes "it ain't pretty but it works" IS a good enough reason!

Hope this helps.

SamBones · Jul 11, 2005

Oh, and the pipe character ([tt]|[/tt]) in the regex is an OR. Egrep allows you to look for multiple patterns. It's matching on the first pattern OR the second pattern.

Hope this helps.

cptk · Jul 11, 2005

duncdude,

The [[:space:]] is one in the list of many POSIX character classes. Note though that you most likely need to explicitly define the path of the unix cmd that supports POSIX (unless you have it defined in the path already). When doing a man page on a particular unix cmd, take note of the SYNOPSIS to see if more then one "flavor" of the cmd exist. For Solaris at least, the POSIX cmds are usually in /usr/xpg4/bin directory.

...And Sammy, the "*" matches any number (or none) of the preceeding regexp or character.

SamBones · Jul 11, 2005

Thanks! So for one or more numeric characters the expression would be "[tt]^[0-9][0-9]*:bird[/tt]" or "[tt]^[0-9]+:bird[/tt]". Right?

cptk · Jul 11, 2005

the breakdown is this:

"*" - matches zero or more
"+" - matches one or more
"?" - matches zero or one

Another useful one is the range occurrence:
{x,y} - matches any number of occurrences between x & y
{x} - matches exactly x occurrences
{x,} - matches at least x occurrences

duncdude · Jul 12, 2005

Looks to me like the regex could be re-written without alternation:-

\bbird\b

i.e. word boundarybirdword boundary

Kind Regards
Duncan

duncdude · Jul 12, 2005

to elaborate...

this post seems to initiate from the alternation being too slow - if the word bird is required to be found in any file either at the beginning of a line or with a space before it (of some sort) then the word boundary trick should suffice

i hope i haven't made a fool of myself!?

Kind Regards
Duncan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

EGREP: slow when using separating choices &quot;|&quot; 2

Technical User

Programmer

Technical User

Programmer

Technical User

MIS

Technical User

Programmer

Technical User

Technical User

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor

EGREP: slow when using separating choices "|" 2