Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

awk with external command 1

Status
Not open for further replies.

FlorianAwk

Programmer
Mar 8, 2012
44
FR
Hello ! :)

I want to filter a text file. The fields $1 and $2 are useless. The fields $3 and more must be searched in another text file (a kind of dictionary). The result seems to be good but it is not. I need your help.

./myProgram.bin

it gives:

04 MOCOS EMREA ROE
04 SOMONI MOTO
04 SOMONI MOTO
05 CHERIF
05 CHERIF
05 CHERIF MC UHA SRRE TIO EFREA
06 CHAMOIS


./myProgram.bin | awk ' NR > 1 {v="true"; mem=$0; for (i=3;i<=NF;i++){ mot=$i; "grep -c "mot" dictionary.txt"|getline cmpt; if ( cmpt == "0"){ v="false";};}; if (v == "true") {print "good " mem;} else print "bad "mem;}'

It gives:

good 04 MOCOS EMREA ROE
good 04 SOMONI MOTO
good 04 SOMONI MOTO
good 05 CHERIF
good 05 CHERIF
bad 05 CHERIF MC UHA SRRE TIO EFREA
good 06 CHAMOIS

Ok for most of them but EMREA is not in my dictionary.

What am I doing wrong ?
 
Your code seems to work for me... there must be a problem in your data or (more likely) the dictionary.

I'd recommend not using grep that way though as it's inefficient. You could load the dictionary up into an awk array first, and then just check whether the words are in the array.

Code:
./myProgram.bin | awk '
        [green]BEGIN[/green] { [olive]while[/olive] ([b]getline[/b] < [red]"[/red][purple]dictionary.txt[/purple][red]"[/red]) mots[[blue]$0[/blue]]; [b]close[/b]([red]"[/red][purple]dictionary.txt[/purple][red]"[/red]) }
        [blue]NR[/blue] > 1 {
                v=1
                [olive]for[/olive] (i=3;i<=[blue]NF[/blue];i++) [olive]if[/olive] (!([blue]$i[/blue] [olive]in[/olive] mots)) v=0
                [olive]if[/olive] (v) { [b]print[/b] [red]"[/red][purple]good [/purple][red]"[/red] [blue]$0[/blue] } [olive]else[/olive] [b]print[/b] [red]"[/red][purple]bad [/purple][red]"[/red][blue]$0[/blue]
        }
'


Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Thank you for your answer.

I should have said that the letters can be part of a word. Not the whole word. ROE is in my dictionary but ROE is only a part of some words. Not a whole word. That's why I used "grep".

Isn't there another method than "getline" to get the result of an external command ?
 
Well, you can use match() on the items in the array instead.

You need to use getline if you want to capture the output of the external command. You may be thinking of system() which does not capture output?

Incidentally, when I ran your code I had to modify it slightly, adding the brackets:

Code:
("grep -c "mot" dictionary.txt")|getline cmpt

Possibly specific to the flavour of awk I'm using, but may help you too? Another consideration is that you are not closing the "file" (in this case a command), so if you have a lot of data you may run out of file handles. I would normally do something like this:

Code:
cmd="grep -c "mot" dictionary.txt"
cmd | getline cmpt
close(cmd)

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 

Thank you very much. :)

I have integrated your two last tips and it works so well I don't see mistakes anymore.
So I have destroyed "mot", "mem", "good" and the "else"-statement. It still works.
Then I have put the code into an older bash script. I had to add anti-slashes, but it works finally.

;-)
 
I tried match() to see the difference. It is clearly faster with "grep". I don't know the difference of the algorithm but calling grep each time is faster than loading the dictonary in awk first and looking for matches then.

./myProgram.bin |awk 'BEGIN{ while(getline < "dictionary.txt") mots[$0]; close("dictionary.txt");} NR>1{v=1;for (i=3;i<=NF;i++) for (m in mots) if (match(m,$i)==0) v=0; if (v) {print "good" $0;}}'

too long
 
Interesting... is it a large dictionary? grep is one of the most efficient programmes written... but I would have expected the cost of executing it many times to be higher.

I'm glad it's working.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
The dictionary is a text file of 378000 lines written in capitals and sorted alphabetically. (1 line = 1 word)
Awk if not long to load the dictionnary but long to match. There is a time between each printed word on screen.
 
Wow, that's a big dictionary; no wonder it's slow.

One other suggestion I'd have is to use found=system("fgrep -q "$1" dictionary.txt") rather than reading in a count value with getline. It should return 1 when a match is found, 0 otherwise.

This may also save fgrep searching the entire dictionary each time, because it can stop searching as soon as a match is found.

fgrep (or grep -F) is better for this task because you are searching for a simple substring rather than a regular expression.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Correction:

It should return [red]0[/red] when a match is found, [red]1[/red] otherwise.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
First, I change the "for" with a "while" and I stop as soon as possible.

Then, very good idea to use "-q"(with only "system()" ).
With "-q"
real 0m8.246s
user 0m2.262s
sys 0m4.277s
Without "-q"
real 0m43.304s
user 0m24.802s
sys 0m14.385s

In all my test, fgrep and grep spend the same time even if it is always slighly better for grep!

Thank you.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top