New to using AWK - query regarding reading in from file 1

jamie999 · Sep 28, 2011

Hi there,

Firstly apologies, I am completely new to AWK but am a reasonable Perl programmer.
I have actually completed the following task using a perl script, but I feel that AWK *may* be faster so it would be good to know if what I want to do can be carried out in AWK as in all likelihood I will need to repeat the task again and possibly on larger files.

The task is quite simple:
A file contains a list of names. These also exist as column headers in several other files.
So for example:

name_file has:
red
blue
purple

other_file1 has:
green blue yellow red
0.4 0.3 0.2 0.7
0.1 0.5 0.9 0.2
etc...

What I need to do is extract the full column where the header matches a name from the name_file, so in this case it would be:

blue red
0.3 0.7
0.5 0.2

and then send those selected columns to a new file.

I know that you can use something like

Code:

awk -f2,4 other_file1 > new_file1

to achieve this from the command line, but this isn't rally appropriate for this case.
Is it possible to do all the above with awk? I also toyed with the idea of doing the original matching in perl and then passing a string with all the positions, a bit like this:

Code:

system(awk -f$stringpos $file > $fileout)

But this had all sorts of errors, mainly because it won't accept the string being passed with the -f.

Any pointers gladly accepted!

feherke · Sep 28, 2011

Hi

jamie999 said:
I know that you can use something like

Code:

awk -f2,4 other_file1 > new_file1

Not really. That looks like [tt]cut[/tt] syntax.

Code:

awk 'FNR==NR{r[$1]=1;next}FNR==1{n=0;for(i=1;i<=NF;i++)if($i in r)c[++n]=i}{for(i=1;i<=n;i++)printf"%s%s",$c[i],i<n?OFS:ORS}' name_file other_file > new_file

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.

http://free.rootshell.be/~feherke/

feherke · Sep 28, 2011

Hi

As you mentioned large files, metaprogramming could make it significantly faster by reducing the operations to do while processing the other_file :

Code:

awk 'FNR==NR{r[$1]=1;next}FNR==1{s="";for(i=1;i<=NF;i++)if($i in r)s=s (s?",":"")"\\$"i;printf"awk \"{print%s}\" \"%s\"\n",s,FILENAME;exit}' name_file other_file | sh > new_file

Or the same as above generating a [tt]cut[/tt] command :

Code:

awk 'FNR==NR{r[$1]=1;next}FNR==1{s="";for(i=1;i<=NF;i++)if($i in r)s=s (s?",":"")i;printf"cut -d\" \" -f%s \"%s\"\n",s,FILENAME;exit}' name_file other_file | sh > new_file

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.

http://free.rootshell.be/~feherke/

jamie999 · Sep 29, 2011

Thank you feherke - I'll have a play about with both options and see which best suits.

Yes sorry, that was the code for cut, I'd been comparing cut, awk and perl performing the same task and found that awk and cut took approximately the same time but perl was slower. I obviously mixed up my cut and awk syntax when I posted - apologies!

feherke · Sep 29, 2011

Hi

jamie999 said:
found that awk and cut took approximately the same time

That may depend on your AWK implementation too. Fro example here 10 million rows took 2 seconds to [tt]cut[/tt], 5.6 seconds to [tt]mawk[/tt] and 8.6 seconds to [tt]gawk[/tt].

Feherke.

http://free.rootshell.be/~feherke/

jamie999 · Sep 29, 2011

Hmmmm, in that case I may go with cut option. Just one question. The code you've given won't work as such. That's not your fault, I gave a slightly simplified version of the task for ease of explanation. The list of names in the name_file are not totally identical to the headers in the other_file. In perl I just used match - I understand that awk has similar syntax?

feherke · Sep 29, 2011

Hi

You mean like :

Code:

red
blue
purple

vs.

Code:

green [red]dark[/red]blue yellow red[red]dish[/red]
0.4 0.3 0.2 0.7
0.1 0.5 0.9 0.2

Then these modifications will work :

Code:

awk 'FNR==NR{r[[red]NR[/red]]=[red]$1[/red];next}FNR==1{n=0;for(i=1;i<=NF;i++)[red]for(j=1;j in r;j++)if($i~r[j])[/red]c[++n]=i}{for(i=1;i<=n;i++)printf"%s%s",$c[i],i<n?OFS:ORS}' name_file other_file > new_file

[gray]# or[/gray]

awk 'FNR==NR{r[[red]NR[/red]]=[red]$1[/red];next}FNR==1{s="";for(i=1;i<=NF;i++)[red]for(j=1;j in r;j++)if($i~r[j])[/red]s=s (s?",":"")"\\$"i;printf"awk \"{print%s}\" \"%s\"\n",s,FILENAME;exit}' name_file other_file | sh > new_file

[gray]# or[/gray]

awk 'FNR==NR{r[[red]NR[/red]]=[red]$1[/red];next}FNR==1{s="";for(i=1;i<=NF;i++)[red]for(j=1;j in r;j++)if($i~r[j])[/red]s=s (s?",":"")i;printf"cut -d\" \" -f%s \"%s\"\n",s,FILENAME;exit}' name_file other_file | sh > new_file

Feherke.

http://free.rootshell.be/~feherke/

jamie999 · Sep 29, 2011

Yes that sort thing exactly - thank you Feherke. I'll sit down and try and learn awk properly when I have more time, it seems like it is significantly quicker than perl for these relatively simple tasks.

Thanks,
Jamie.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

New to using AWK - query regarding reading in from file 1

jamie999

Programmer

feherke

Programmer

feherke

Programmer

jamie999

Programmer

feherke

Programmer

jamie999

Programmer

feherke

Programmer

jamie999

Programmer

Similar threads

Part and Inventory Search

Sponsor