use key to extract, extracted file has more keys than used to extract

will27 · Sep 7, 2007

Hi, all
I get many keys to extract records in other files. However, following scripts doesn't spell out what I expected.
Both the keyfile and to be extracted files are tab dlimited.

# keys in three column
for keycol in 1 2 3

do

# creat a vector containing the key
for key in `awk -F"\t" '{print $x}' x=$keycol keyfile | uniq`

do

# extracts records in another files when the key match
awk -F"\t" -v y=$key '$5 >=19950101 && $10==y ' MF8098_dfc MF9906_dfc >> "$keycol"_MF_extraction

done

done

the results are three extracted files, each results from matching keys in the corresponding column of the keyfile.

problem
1. uniq(key in an extracted file ) does not match uniq(key in key in the corresponding key column of the key file ),

2. and the number of records in the extracted file evidently surpassed its theoretic limits

3. the scripts really runs slow, is there any faster way to do that?

Thank you for suggestion

Will

feherke · Sep 8, 2007

Hi

Well, your code really looks bad. Let us review the requirements.

Form file A you extract records and put them into 3 files. The values to search for are obtained from file B. The file into which a record of file A will be put, depends on the column in file B from where the value was extracted.

Well, neither my code looks too nice, it misses the real multidimensional arrays...

Code:

awk -F"\t" 'FNR==NR{v1[$1]=1;v2[$2]=1;v3[$3]=1;next}$5>=19950101{if(v1[$10])print>"ext1";if(v2[$10])print>"ext2";if(v3[$10])print>"ext3"}' fileB fileA

Tested with [tt]gawk[/tt] and [tt]awk95[/tt].

Note that you can add more fileAs at the end of the command line.

Feherke.

http://rootshell.be/~feherke/

will27 · Sep 9, 2007

Dear feherke

I really thank you so much and sorry for this late gratitude.

Your code is several magnitude faster than mine and, most importantly, it spells out results as I want.

However, I am still wondering where is the problem of my original code (I know it's ugly and will stick to the style doing things with one tool as much as I can), if I don't know where the error originate, I might fall victim some times else.

Thank you again
will

feherke · Sep 9, 2007

Hi

Will said:
However, I am still wondering where is the problem of my original code

Your problem with [tt]uniq[/tt] is probably because your data is not sorted. And [tt]uniq[/tt] expects sorted data. Probably you should use [tt]sort -u[/tt] instead.

Regarding the speed :
[ul]
[li]you start several [tt]awk[/tt] processes[/li]
[li]each [tt]awk[/tt] process in the inner loop takes a full pass through the input files[/li]
[li]you pass data through pipe to a second process

Bash:

[blue]master #[/blue] time awk '{print $x}' x=1 fileA | sort -u > output1
real    0m0.170s
user    0m0.138s
sys     0m0.154s

[blue]master #[/blue] time awk '!u[$x]{print $x;u[$x]=1}' x=1 fileA > output2
real    0m0.118s
user    0m0.124s
sys     0m0.016s

[/li]
[li]you write the output sequentially, line by line

Bash:

[blue]master #[/blue] time for ((i=0;i<10000;i++)); do echo $i >> outfile1; done
real    0m7.419s
user    0m1.500s
sys     0m4.219s

[blue]master #[/blue] time for ((i=0;i<10000;i++)); do echo $i; done > outfile2
real    0m1.568s
user    0m0.844s
sys     0m0.672s

[/li]
[/ul]
This is why I like shell scripting : immediately shows the inefficiencies of your code.

Feherke.

http://rootshell.be/~feherke/

will27 · Sep 18, 2007

Thanks, feherke
that helps a lot

will

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

use key to extract, extracted file has more keys than used to extract

will27

Technical User

feherke

Programmer

will27

Technical User

feherke

Programmer

will27

Technical User

Similar threads

Part and Inventory Search

Sponsor