Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

use key to extract, extracted file has more keys than used to extract

Status
Not open for further replies.

will27

Technical User
Jun 13, 2007
23
US
Hi, all
I get many keys to extract records in other files. However, following scripts doesn't spell out what I expected.
Both the keyfile and to be extracted files are tab dlimited.


# keys in three column
for keycol in 1 2 3

do

# creat a vector containing the key
for key in `awk -F"\t" '{print $x}' x=$keycol keyfile | uniq`

do

# extracts records in another files when the key match
awk -F"\t" -v y=$key '$5 >=19950101 && $10==y ' MF8098_dfc MF9906_dfc >> "$keycol"_MF_extraction

done

done




the results are three extracted files, each results from matching keys in the corresponding column of the keyfile.

problem
1. uniq(key in an extracted file ) does not match uniq(key in key in the corresponding key column of the key file ),

2. and the number of records in the extracted file evidently surpassed its theoretic limits

3. the scripts really runs slow, is there any faster way to do that?

Thank you for suggestion

Will
 
Hi

Well, your code really looks bad. Let us review the requirements.

Form file A you extract records and put them into 3 files. The values to search for are obtained from file B. The file into which a record of file A will be put, depends on the column in file B from where the value was extracted.

Well, neither my code looks too nice, it misses the real multidimensional arrays...
Code:
awk -F"\t" 'FNR==NR{v1[$1]=1;v2[$2]=1;v3[$3]=1;next}$5>=19950101{if(v1[$10])print>"ext1";if(v2[$10])print>"ext2";if(v3[$10])print>"ext3"}' fileB fileA
Tested with [tt]gawk[/tt] and [tt]awk95[/tt].

Note that you can add more fileAs at the end of the command line.

Feherke.
 
Dear feherke

I really thank you so much and sorry for this late gratitude.

Your code is several magnitude faster than mine and, most importantly, it spells out results as I want.

However, I am still wondering where is the problem of my original code (I know it's ugly and will stick to the style doing things with one tool as much as I can), if I don't know where the error originate, I might fall victim some times else.

Thank you again
will

 
Hi

Will said:
However, I am still wondering where is the problem of my original code
Your problem with [tt]uniq[/tt] is probably because your data is not sorted. And [tt]uniq[/tt] expects sorted data. Probably you should use [tt]sort -u[/tt] instead.

Regarding the speed :
[ul]
[li]you start several [tt]awk[/tt] processes[/li]
[li]each [tt]awk[/tt] process in the inner loop takes a full pass through the input files[/li]
[li]you pass data through pipe to a second process
Bash:
[blue]master #[/blue] time awk '{print $x}' x=1 fileA | sort -u > output1
real    0m0.170s
user    0m0.138s
sys     0m0.154s

[blue]master #[/blue] time awk '!u[$x]{print $x;u[$x]=1}' x=1 fileA > output2
real    0m0.118s
user    0m0.124s
sys     0m0.016s
[/li]
[li]you write the output sequentially, line by line
Bash:
[blue]master #[/blue] time for ((i=0;i<10000;i++)); do echo $i >> outfile1; done
real    0m7.419s
user    0m1.500s
sys     0m4.219s

[blue]master #[/blue] time for ((i=0;i<10000;i++)); do echo $i; done > outfile2
real    0m1.568s
user    0m0.844s
sys     0m0.672s
[/li]
[/ul]
This is why I like shell scripting : immediately shows the inefficiencies of your code. :)

Feherke.
 
Thanks, feherke
that helps a lot

will
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top