Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

search from one file based on contents of another

Status
Not open for further replies.

tafkid

Programmer
Dec 14, 2008
9
US
I have two related sets of files that I need to search. Please note that the dashes are not part of the file they just separate the file names
from the contents
Set 1 has the following fields date, record number, name

set1_2009-05-20.txt << file name
-------------------
2009-05-20~1~Name1
2009-05-20~2~Name2
2009-05-20~3~Name1
2009-05-20~4~Name3
2009-05-20~17~Name3

set1_2009-05-14.txt
--------------------
2009-05-21~14~Name4
2009-05-21~15~Name1
2009-05-21~17~Name1
2009-05-21~18~Name4
2009-05-21~21~Name4

Set 2 has the following format record number, value

set2_2009-05-20.txt << file name
-------------------
1~der
1~tyu
2~rxs
2~43w
3~5sd
4~rte
17~56g

set2_2009-05-14.txt
--------------------
14~res
15~oph
15~fgh
17~def
18~aba
21~ghc

At the end of my search I am supposed to come up with two files set1_out.txt and set2_out.txt. From set 1, (easy part) I need all records with Name 1
this I get by running the simple awk command
Code:
awk -F~ '{if (tolower($3)=="name1") print $0}' set1*.txt >set1_out.txt

this gives me
set1_out.txt
-----------------------------
2009-05-20~1~Name1
2009-05-20~3~Name1
2009-05-21~15~Name1
2009-05-21~17~Name1

getting set2_out.txt is a little trickier. For every record number, date pair in set1_out.txt I have to find the corresponding record and file
in set 2 and end up with a file with the structure "record number~value~date". With date being the date stamp on the file being processed(which is also the date in the matching line of set1_out.txt). e.g

set2_out.txt
----------------
1~der~2009-05-20
1~tyu~2009-05-20
3~5sd~2009-05-20
15~oph~2009-05-21
15~fgh~2009-05-21
17~def~2009-05-21

With my limited shell scripting skills I have no idea where to begin on the second part. I hope this is the right forum for this.

Thanks in advance for your help
 
Hi

As far as I understand, there is no way to accomplish that. The record numbers are not unique, so date+record number pairs should be used. But the set2* files have no date.

Ideally the date in the file name could be used, but as your set1* sample shows, there is no relation between the date in the file name and the dates in the file contents. So we can not use this :
Code:
awk -F~ -vOFS=~ 'FNR==NR{p[$1,$2]=1;next}FNR==1{f=substr(FILENAME,6,10)}p[f,$1]{$3=f;print}' set1_out.txt set2*.txt > set2_out.txt
By the way, your code for set1_out can be reduced to this :
Code:
awk -F~ 'tolower($3)=="name1"' set1*.txt > set1_out.txt


Feherke.
 
Sorry about that. There was a typo on file names of the second file in set1 and set 2. both file names have been set1_2009-05-21.txt and set2_2009-05-21.txt respectively not setX_2009-05-14.txt as shown so the date in the contents is related to the file name. Simply put I'm trying to match the record number in a set1 file with a record number in a set2 file that the same date stamp as that of the set1 file.
Thanks
 
Sorry about that. There was a typo on file names of the second file in set1 and set 2. both file names have been set1_2009-05-21.txt and set2_2009-05-21.txt respectively not setX_2009-05-14.txt as shown so the date in the contents is related to the file name. Simply put I'm trying to match the record number in a set1 file with a record number in a set2 file that the same date stamp as that of the set1 file.
Thanks for the pointer on shrinking my code.
 
Hi feherke

I have tested your code but it doesn't seem to be working it prints out a blank file for set2_out.txt. I can't figure out what is wrong.

Also is it possible to use the dates on the file names only instead of using the date in the file contents from set1?

thanks
 
Hi

It works for me with [tt]gawk[/tt] and [tt]mawk[/tt]. Which [tt]awk[/tt] implementation are you using ?

tafkid said:
Also is it possible to use the dates on the file names only instead of using the date in the file contents from set1?
I see no reason for that, but you can do it by reusing the first block of my code to place the date in variable f :
Code:
awk -F~ -vOFS=~ 'FNR==1{f=substr(FILENAME,6,10)}tolower($3)=="name1"{$1=f;print}' set1*.txt > set1_out.txt
By the way, you know that if you already have the set1_out.txt created, subsequent runs of that command will grab that too with the set1*.txt wildcard as input file ? I suggest specify the input files as set1_????*.txt to exclude the set1_out.txt file.

Feherke.
 
I have re-tested and both variations work. Thanks a lot for the help.
 
So far I have been able to use the awk commands with good results however, I have noticed that as my set of files grows, it takes longer to get the output files. In time I may have memory issues. I'm thinking one way to cut down on the time and memory used would be to search only those set2 files that are relevant.
For example given the files.

set1 set2
set1_2009-05-14.txt set2_2009-05-14.txt
set1_2009-05-15.txt set2_2009-05-14.txt
set1_2009-05-16.txt set2_2009-05-14.txt

If using the command below I only find records in set1_2009-05-15.txt then I would only parse set2_2009-05-15.txt and ignore all others in set2

Code:
awk -F~ 'tolower($3)=="name1"' set1*.txt > setA_out.txt

My thought was use the first column of setA_out.txt which is the date and pass that to the command below so that I can only parse the corresponding files in Set2.

Code:
awk -F~ -vOFS=~ 'FNR==NR{p[$1,$2]=1;next}FNR==1{f=substr(FILENAME,6,10)}p[f,$1]{$3=f;print}' setA_out.txt set2*.txt > setB_out.txt

I'm stuck because I don't know how to pass the value read and concatenate that with the file name. Is that even possible?

Thanks
 
You could use a trick like this to append the input file names to the command-line:

Code:
awk -F~ -vOFS=~ '
    FNR==NR{
        p[$1,$2]=1
        if (!($1 in dates)) {
             dates[$1]=1
             ARGV[ARGC++]="set2_"$1".txt"
        }
        next
    }
    FNR==1{f=substr(FILENAME,6,10)}p[f,$1]{$3=f;print}
' setA_out.txt > setB_out.txt

Annihilannic.
 
Didn't get round to using your example until this morning but it works. Thanks Annihilannic. It easily scales to hundreds of files.
 
Hi

Here on Tek-Tips we used to thank for the received help by giving stars. Please click the

* [navy]Thank Annihilannic
for this valuable post![/navy]


at the bottom of Annihilannic's post. That way you both show your gratitude and indicate this thread as helpful.

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top