Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

EGREP: slow when using separating choices "|" 2

Status
Not open for further replies.

cptk

Technical User
Mar 18, 2003
305
US
If I have ...

egrep -i "[[:space:]]bird" somefile.txt
the processing time is relatively quick.

However, once I introduce a "choice" in my regexp, the processing time is very slow ...

egrep -i "(^|[[:space:]])bird" somefile.txt
it works as expected, but very slow, epsecially for large files!

Using "sed" and/or "awk" to achieve nearly the same result works a lot faster, BUT neither one can handle case sensitivity!!

Is there a trick to speeding-up this form of egrep?
[note: I'm on Solaris 9 using ksh shell)
 
Hmmm, my egrep doesn't seem to have the [tt]\b[/tt]. What OS is that on?
 
Hi Sam

My experience is only really within Perl - but i'll have a look to see if i can find anything. I am fairly adept with regular expressions within Perl and just felt that if this alternation problem was really painful on large files that the regex could be altered to remove it all together. Hope i have not misunderstood.


Kind Regards
Duncan
 
cptk said:
Using "sed" and/or "awk" to achieve nearly the same result works a lot faster, BUT neither one can handle case sensitivity!!

In Awk, use tolower():

Code:
{ s = tolower($0) }
s ~ /(^|[ \t])bird/ { print NR ":" $0 }
 
Sed can too, but it's kind of cumbersome...
Code:
sed -n '/^[Bb][Ii][Rr][Dd]/p;/[:space:][Bb][Ii][Rr][Dd]/p' somefile.txt
Hope this helps.
 
Hi

Excuse me, I would like to insist on the [tt]sed[/tt]'s case insensitivity. [tt]sed[/tt] has a modifier [tt]I[/tt] for this purpose, for both matching ( [tt]//[/tt] ) and substituting ( [tt]s///[/tt] ). I have Linux on my computer and [tt]sed[/tt] versions 3.02.80 and 4.1.2, and the above modifier works with both.

As I have no chance to try it on other systems, can you tell me briefly on which systems and versions not works ?

Thanks,
Feherke.
 
On Solaris 8 it gives the following error message...
Code:
Unrecognized command: /^bird/Ip; /[[:space:]]bird/Ip
It does work on a Red Hat Linux ES system I have.

Hope this helps.
 
Hi

Thanks. Sad. That [tt]I[/tt] is pretty usefull.

Then something constructive : lets combine SamBones [tt]sed[/tt] script with futurelet's [tt]awk[/tt] solution of using [tt]tolower()[/tt]. I think transliteration should be faster then matching four character classes :

Code:
sed -n 'h;y/BIRD/bird/;/^bird/{g;p};/[[:space:]]bird/{g;p}' somefile.txt

Feherke.
 
1.) I agree with Sammy, Solaris 9 as well doesn't support sed's "I" cmd ... I get the same error.

2.) When I said "...Sed and/or awk couldn't handle case sensitivity", what I really wanted to say is that there's not a switch to "toggle" it on/off like the grep family has. I'm aware of awk's "tolower/toupper" cmds, but if I was to use awk for case sensitivity and have the ability to toggle it off/on [i.e. - user choice in script] then I would need to build-in the ability to decide to use the toupper/lower cmd(s) and thus feed the awk an external variable (set by the user) ... it's doable with awk, but it gets a little messy for my requirements. Sed you can forget about it if you have to explicitly define each character!!


 
Code:
if $insensitive
awk 'tolower($0)~/(^|[ \t])bird/{print NR":"$0}' foo.txt
else
awk '$0~/(^|[ \t])bird/{print NR":"$0}' foo.txt
endif
 
-w for word-matching in grep (without reading the whole thread):
Code:
cat test.txt

Birdingham is not a valid cityname.
A bird is singing in the tree.
Two birds fly through the dark.
Here is a bird. There is a cat.
"Bird, I'll get ya!" the cat thought.
Bird was cought by the tiger of the moon.

grep -wi bird test.txt

A bird is singing in the tree.
Here is a bird. There is a cat.
"Bird, I'll get ya!" the cat thought.
Bird was cought by the tiger of the moon.
grep (GNU grep) 2.5.1


seeking a job as java-programmer in Berlin:
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top