awk with external command 1

FlorianAwk · Mar 8, 2012

Hello !

I want to filter a text file. The fields $1 and $2 are useless. The fields $3 and more must be searched in another text file (a kind of dictionary). The result seems to be good but it is not. I need your help.

./myProgram.bin

it gives:

04 MOCOS EMREA ROE
04 SOMONI MOTO
04 SOMONI MOTO
05 CHERIF
05 CHERIF
05 CHERIF MC UHA SRRE TIO EFREA
06 CHAMOIS

./myProgram.bin | awk ' NR > 1 {v="true"; mem=$0; for (i=3;i<=NF;i++){ mot=$i; "grep -c "mot" dictionary.txt"|getline cmpt; if ( cmpt == "0"){ v="false";};}; if (v == "true") {print "good " mem;} else print "bad "mem;}'

It gives:

good 04 MOCOS EMREA ROE
good 04 SOMONI MOTO
good 04 SOMONI MOTO
good 05 CHERIF
good 05 CHERIF
bad 05 CHERIF MC UHA SRRE TIO EFREA
good 06 CHAMOIS

Ok for most of them but EMREA is not in my dictionary.

What am I doing wrong ?

Annihilannic · Mar 8, 2012

Your code seems to work for me... there must be a problem in your data or (more likely) the dictionary.

I'd recommend not using grep that way though as it's inefficient. You could load the dictionary up into an awk array first, and then just check whether the words are in the array.

Code:

./myProgram.bin | awk '
        [green]BEGIN[/green] { [olive]while[/olive] ([b]getline[/b] < [red]"[/red][purple]dictionary.txt[/purple][red]"[/red]) mots[[blue]$0[/blue]]; [b]close[/b]([red]"[/red][purple]dictionary.txt[/purple][red]"[/red]) }
        [blue]NR[/blue] > 1 {
                v=1
                [olive]for[/olive] (i=3;i<=[blue]NF[/blue];i++) [olive]if[/olive] (!([blue]$i[/blue] [olive]in[/olive] mots)) v=0
                [olive]if[/olive] (v) { [b]print[/b] [red]"[/red][purple]good [/purple][red]"[/red] [blue]$0[/blue] } [olive]else[/olive] [b]print[/b] [red]"[/red][purple]bad [/purple][red]"[/red][blue]$0[/blue]
        }
'

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

FlorianAwk · Mar 8, 2012

Thank you for your answer.

I should have said that the letters can be part of a word. Not the whole word. ROE is in my dictionary but ROE is only a part of some words. Not a whole word. That's why I used "grep".

Isn't there another method than "getline" to get the result of an external command ?

Annihilannic · Mar 8, 2012

Well, you can use match() on the items in the array instead.

You need to use getline if you want to capture the output of the external command. You may be thinking of system() which does not capture output?

Incidentally, when I ran your code I had to modify it slightly, adding the brackets:

Code:

("grep -c "mot" dictionary.txt")|getline cmpt

Possibly specific to the flavour of awk I'm using, but may help you too? Another consideration is that you are not closing the "file" (in this case a command), so if you have a lot of data you may run out of file handles. I would normally do something like this:

Code:

cmd="grep -c "mot" dictionary.txt"
cmd | getline cmpt
close(cmd)

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

FlorianAwk · Mar 8, 2012

Thank you very much.

I have integrated your two last tips and it works so well I don't see mistakes anymore.
So I have destroyed "mot", "mem", "good" and the "else"-statement. It still works.
Then I have put the code into an older bash script. I had to add anti-slashes, but it works finally.

;-)

FlorianAwk · Mar 9, 2012

I tried match() to see the difference. It is clearly faster with "grep". I don't know the difference of the algorithm but calling grep each time is faster than loading the dictonary in awk first and looking for matches then.

./myProgram.bin |awk 'BEGIN{ while(getline < "dictionary.txt") mots[$0]; close("dictionary.txt");} NR>1{v=1;for (i=3;i<=NF;i++) for (m in mots) if (match(m,$i)==0) v=0; if (v) {print "good" $0;}}'

too long

Annihilannic · Mar 11, 2012

Interesting... is it a large dictionary? grep is one of the most efficient programmes written... but I would have expected the cost of executing it many times to be higher.

I'm glad it's working.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

FlorianAwk · Mar 13, 2012

The dictionary is a text file of 378000 lines written in capitals and sorted alphabetically. (1 line = 1 word)
Awk if not long to load the dictionnary but long to match. There is a time between each printed word on screen.

Annihilannic · Mar 13, 2012

Wow, that's a big dictionary; no wonder it's slow.

One other suggestion I'd have is to use found=system("fgrep -q "$1" dictionary.txt") rather than reading in a count value with getline. It should return 1 when a match is found, 0 otherwise.

This may also save fgrep searching the entire dictionary each time, because it can stop searching as soon as a match is found.

fgrep (or grep -F) is better for this task because you are searching for a simple substring rather than a regular expression.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Annihilannic · Mar 13, 2012

Correction:

It should return [red]0[/red] when a match is found, [red]1[/red] otherwise.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

FlorianAwk · Mar 21, 2012

First, I change the "for" with a "while" and I stop as soon as possible.

Then, very good idea to use "-q"(with only "system()" ).
With "-q"
real 0m8.246s
user 0m2.262s
sys 0m4.277s
Without "-q"
real 0m43.304s
user 0m24.802s
sys 0m14.385s

In all my test, fgrep and grep spend the same time even if it is always slighly better for grep!

Thank you.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

awk with external command 1

FlorianAwk

Programmer

Annihilannic

MIS

FlorianAwk

Programmer

Annihilannic

MIS

FlorianAwk

Programmer

FlorianAwk

Programmer

Annihilannic

MIS

FlorianAwk

Programmer

Annihilannic

MIS

Annihilannic

MIS

FlorianAwk

Programmer

Similar threads

Part and Inventory Search

Sponsor