match field in 2 files and output to a 3rd file 2

lpostell · May 16, 2005

I'm trying to create a shell script that performs a nawk command to do the following:

I have 2 files. file one contains only serial numbers. File 2 contains serial number plus detailed information. I need to read the serial number file and find matches on the first 9 characters in file 2. If a match is found then I need to print the detail record to a 3rd file.

File 1 - serial number file
ZZZ222222
TTT888888
FFF333333
AAA777777

File 2 - detail file
CCC444444xxx888 xx3
AAA777777zzz999 ii4
OOO333333iii000 nn9
TTT888888vvv777 ee8
SSS999999hhh444 xx9

Output file 3 should look like this:

TTT888888vvv777 ee8
AAA777777zzz999 ii4

thank you for any help provided,
Lorraine

PHV · May 17, 2005

Something like this ?
nawk 'NR==FNR{sn[NR]=$1;next}substr($1,1,9) in sn' file1 file2 > file3

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

futurelet · May 17, 2005

Code:

# IF we're reading first file ...
NR==FNR {
  # Add serial number to list.
  list[ $0 ]
  # Skip rest of program and read next line.
  next
}

# We're reading second file.
length($0)>8 && substr($0,1,9) in list

Save this code in [tt]match-ser-num.awk[/tt] and run with
[tt]awk -f match-ser-num.awk file1 file2 >outfile[/tt]

Let me know whether or not this helps.

If you have nawk, use it instead of awk because on some systems awk is very old and lacks many useful features. Under Solaris, use /usr/xpg4/bin/awk.

For an introduction to Awk, see faq271-5564.

futurelet · May 17, 2005

PHV, your code is wrong. May I suggest again that you ought not to post untested code?

PHV · May 17, 2005

Sorry for the typo:
nawk 'NR==FNR{++sn[$1];next}substr($1,1,9) in sn' file1 file2 > file3

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

lpostell · May 17, 2005

To PHV

God Bless you. This solution worked perfectly:

nawk 'NR==FNR{++sn[$1];next}substr($1,1,9) in sn' file1 file2 > file3

Could I trouble you to help me understand what this is doing? From my understanding:
NR - the line number of the current line (but how do I know which of the 2 input files it is? is it the first one listed?)
FNR - same thing?
(this I understand > file3 - place all output in file3)

what does this mean NR==FNR? How does that work?

I think my main questions is, how do I know which commands are referring to file1 and which are referring to file2?

And how is this cycling through both files? Is it reading each record at a time in file1 and looking through all of file2 for a match each time.

On a different note - I will be running this from a unix c-shell script because there are a few things happening before and after this. What is the proper syntax to execute this from with a shell script?
Thank you so much for your time,
Lorraine

lpostell · May 17, 2005

to futurelet

Thank you so much for your time and your solution.

awk -f match-ser-num.awk file1 file2 >outfile

Your solution also would work perfectly if my data was good. When I ran it, I did not get all the matches that I expected, but that was because some of the serial numbers had an extra spaces at the end.
So I changed a few things and this is what ended up working for my data.

# we're reading first file ...
NR==FNR {
# Add serial number to list.
list [substr($1,1,9)]
# Skip rest of program and read next line.
next
}

# We're reading second file.
substr($1,1,9) in list

I like your solution because it is a little easier to understand. The evidence of that is I don't know awk, but I was able to modify it to work for my set of data.

However, I have the same question I had for PHV. I appreciate your comments in the code, that is very helpful. But, how can I understand the mind of awk? How do I know when I am reading from the first file and when I am reading from the second file? Or how do I know which file the commands I'm executing are referring to?

Does it always read through the whole first file first?

Also, do you think this will work on a full file of one million records in each file?

God bless you,
Lorraine

futurelet · May 17, 2005

[tt]NR[/tt] is the total number of lines that have been read from the data files. [tt]FNR[/tt] is the number of lines that
have been read from the current data file. When the two variables are equal, we're reading the first file.

If you want to start understanding Awk, see FAQ271-5564.

Does it always read through the whole first file first?

Yes, unless you prevent it.

Also, do you think this will work on a full file of one million records in each file?

Probably. The first file is the one that has to be held in memory. Modern computers have lots of RAM and if memory is exhausted, the hard-drive is used as virtual memory.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

match field in 2 files and output to a 3rd file 2

lpostell

Programmer

PHV

MIS

futurelet

Programmer

futurelet

Programmer

PHV

MIS

lpostell

Programmer

lpostell

Programmer

futurelet

Programmer

Similar threads

Part and Inventory Search

Sponsor