Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Merge two files in awk more efficiently 1

Status
Not open for further replies.

mmmdoughnuts

Technical User
Nov 1, 2007
4
US
Hi!
I'm not very experienced at programming but I wrote a script to manipulate two sets of data into a third file, but it's ugly and probably not very efficient. Here is the problem: Suppose I have two files, a.txt and b.txt; a.txt contains five columns (separated by spaces) and b.txt contains 13 columns (separated by commas). a.txt will always have fewer lines than b.txt and the values in column 1 of a.txt can be found in column 1 of b.txt. For example:

a.txt:
1111 asdf asdf asdf asdf
3333 asdf asdf asdf afsf
7777 assa asdf asdf skfa

b.txt
1111,2,3,4,5,6,7,8,9,10,11,12,1111
2222,2,3,4,5,6,7,8,9,10,11,12,1234
3333,2,3,4,5,6,7,8,9,10,11,12,3333
4444,2,3,4,5,6,7,8,9,10,11,12,4444
5555,2,3,4,5,6,7,8,9,10,11,12,5555
6666,2,3,4,5,6,7,8,9,10,11,12,6666
7777,2,3,4,5,6,7,8,9,10,11,12,5678
8888,2,3,4,5,6,7,8,9,10,11,12,8888

The important thing to note is that 1111 and 3333 are the same between column 1 in a.txt and column 1 and 13 in b.txt, but 7777 has a different value for column 13 in b.txt (the rest of the data in both files is irrelevant to the manipulation I need to do). I need to merge these two files such that the value from column 13 in b.txt replaces the value in column 1 in a.txt, like so:

1111 asdf asdf asdf asdf
3333 asdf asdf asdf afsf
5678 assa asdf asdf skfa

So, I wrote the following script:

awk '{printf "%s\n",$1}' a.txt > a.column1.txt
for X in `cat a.column1.txt`
do
grep $X b.txt | /bin/awk -F, '{printf "%s\n",$13}' | sed 's/"//g' >> b.column13.txt
done
awk '{printf "%s %s %s %s\n",$2,$3,$4,$5}' a.txt > a.cols2to5.txt
sdiff b.column13.txt a.cols2to5.txt | sed 's/|//g' > ab.merged

As you can tell from the way I explained the problem and the way I wrote the script, I'm new to scripting :)

However, I don't have to be an expert to see that there's probably a much better way to do this. Any feedback would be greatly appreciated.
 
Thanks Feherke for your response. If you don't mind, I'd like to learn this a bit better and some stuff is in the man (like FNR, NR, etc.), but I think I'm missing the logic. Is it possible for you to break down your post so that I can understand what the line is doing? Also, part of the reason I'm asking is because your solution produces this:

asdf asdf asdf asdf
asdf asdf asdf afsf
5678 assa asdf asdf skfa

whereas I need this:

1111 asdf asdf asdf asdf
3333 asdf asdf asdf afsf
5678 assa asdf asdf skfa

So, if you explain to me what your statement does, then I can use it to learn awk and perhaps fiddle with it a bit myself.

I guess the simplest way to look at my problem is I'd like to substitute column1 in a.txt with column 13 in b.txt (with the correct corresponding values of course).

Once again, thank you for your help :)
 
Hi

Code:
awk -F, '      [gray]# will read b.txt first, so set the field separator to ","[/gray]
FNR==NR {      [gray]# if current file record number == overall record number[/gray]
               [gray]# ( while processing the first input file )[/gray]
  b[$1]=$NF    [gray]# we store the last field in an array[/gray]
               [gray]# ( b["1111"]="1111", b["3333"]="3333", b["7777"]="5678" )[/gray]
  [b]next[/b]         [gray]# skip the next code blocks[/gray]
}
[red]FNR==[/red]1 {       [gray]# if current file record number == 1[/gray] [red](*)[/red]
               [gray]# ( first line of the second file )[/gray]
  FS=[i]" "[/i]       [gray]# set the field separator to " "[/gray]
  $0=$0        [gray]# force re-parsing the current record with the new separator[/gray]
}
{              [gray]# always execute this block ( when reached... )[/gray]
  $1=b[$1]     [gray]# replace the first field with matching last field of b.txt[/gray]
}
1              [gray]# always true, always do the default action ( print $0 )[/gray]
' b.txt a.txt  [gray]# first will read b.txt to extract the replacement values[/gray]
[red](*)[/red] There was a minor bug there. My previous code executed that block for every record of the second file. But was just matter of efficiency.

I made my two test files from the sample data you posted and I get this output :
Code:
[blue]master #[/blue] awk -F, 'FNR==NR{b[$1]=$NF;next}FNR==1{FS=" ";$0=$0}{$1=b[$1]}1' b.txt a.txt
1111 asdf asdf asdf asdf
3333 asdf asdf asdf afsf
5678 assa asdf asdf skfa
By the way, I tested it with [tt]gawk[/tt] and [tt]mawk[/tt].


Feherke.
 
Hi Feherke.

Okay, I think I fixed the problem. There were actually two. The first had to do with inconsistent formatting of the data files which I couldn't see unless I used vi. Some lines had spaces, others had tabs, and some lines in b.txt had both double quotes (") and commas (,) within the fields that were separated by commas (,). Thanks to your breakdown, I solved this by filtering out the " with sed and removing the FS==" " section.

The second problem had to do with the fact that some data in column 1 in a.txt had a leading zero whereas the corresponding row in b.txt didn't so $1=b[$1] had nothing to match with. However, since I created a.txt from another script, I edited that script to set column1 in a.txt to %d and not %s and that fixed the problem for the subsequent parse.

Hope all that makes sense. In any case, thanks a bunch for all your help :)
 
Hi

Thanks for the follow up.

One, maybe interesting, thing I found out in meantime. That one-liner does not work correctly in [tt]nawk[/tt]. It skips the a.txt's first line. :-( Quite surprising for me.

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top