AWK getline and join by step

judkil · Mar 23, 2020

227/5000
Hello

I want to perform a join where AWK takes the name of a line in a first file, places it in fields 1 in the output file, then goes to a second file takes the extended names and places them in fields 2.

file_1:
AB-00050832
AB-00058394
AB-00050862
AB-00004123

file_2:
AB-00050832-18.....1....-8.900758
AB-00058394-10.....2....-7.981418
AB-00050832-24.....3....-7.634420
AB-00050862-10.....4....-7.621671
AB-00004123-1......5....-7.386272
AB-00058394-6.......6....-7.383604
AB-00050832-12....14....-7.038594
AB-00050862-6.....50....-6.701126

output:
AB-00050832.....AB-00050832-18.....1....-8.900758
........................AB-00050832-24.....3....-7.634420
........................AB-00050832-12....14....-7.038594
AB-00058394.....AB-00058394-10.....2....-7.981418
........................AB-00058394-6.......6....-7.383604
AB-00050862.....AB-00050862-10.....4....-7.621671
........................AB-00050862-6.....50....-6.701126
AB-00004123.....AB-00004123-1......5....-7.386272

I added points to better visualize, but they are not useful.
file_2 is:
- attached to file_1
- sorted according to file_1 following the order of file_1. It is just important to keep the order in which the name appears in file_1. I tried with a getline but I feel like there is something I can't unlock
He takes.

If you have an idea, thank you very much!

mikrom · Mar 23, 2020

Maybe something similar like this:

https://www.tek-tips.com/viewthread.cfm?qid=1629230

judkil · Mar 23, 2020

It is not the same problem because Awk must join the short name "B" to the extended name "B_1". We are
file_1:
A
B
C

file_2:
A_1
C_2
C_1
B_4
A_2
A_3
B_1

output:
A....A_1
......A_2
......A_3
B.....B_4
.......B_1
C....C_2
......C_1
The order of appearance in file_1 is respected and the order of appearance in file_2 is respected. It's not really a join of two identical columns are merged.

mikrom · Mar 23, 2020

I modified the old script I mentioned above and got this result:

Code:

$ awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt
AB-00050832 .. AB-00050832-18.....1....-8.900758
........... .. AB-00050832-24.....3....-7.634420
........... .. AB-00050832-12....14....-7.038594
AB-00058394 .. AB-00058394-10.....2....-7.981418
........... .. AB-00058394-6.......6....-7.383604
AB-00050862 .. AB-00050862-10.....4....-7.621671
........... .. AB-00050862-6.....50....-6.701126
AB-00004123 .. AB-00004123-1......5....-7.386272
Done.

Is this the result you need?
On the command line I changed the order of the files: so my judkil_file1.txt is your file_2
and my judkil_file2.txt is your file_1

mikrom · Mar 24, 2020

Hi judkil,
Is your problem already solved ? Here is my script from yesterday:

Code:

[COLOR=#0000ff]# Run:[/color]
[COLOR=#0000ff]# awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
}
{ 
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]1[/color]]) {
    [COLOR=#0000ff]# get key from line[/color]
    key[COLOR=#6a5acd]=[/color][COLOR=#008b8b]substr[/color]([COLOR=#6a5acd]$0[/color][COLOR=#6a5acd],[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd],[/color][COLOR=#ff00ff]11[/color])
    [COLOR=#0000ff]# add line from 1.file into array[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]key[/color]]) {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#ff00ff]";"[/color] [COLOR=#6a5acd]$0[/color]
    } 
    [COLOR=#a52a2a][b]else[/b][/color] {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] [COLOR=#6a5acd]$0[/color]
    }
  }
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]2[/color]]) {
    [COLOR=#0000ff]# print adequate lines from 2.file[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]$1[/color]]) {
      [COLOR=#0000ff]#print line_array[$1][/color]
      print_list_of_lines([COLOR=#6a5acd]$1[/color][COLOR=#6a5acd],[/color] line_array[[COLOR=#6a5acd]$1[/color]])
    }
  }
}
[COLOR=#6a5acd]END[/color] {
  [COLOR=#a52a2a][b]print[/b][/color] [COLOR=#ff00ff]"Done."[/color]
}

[COLOR=#a52a2a][b]function[/b][/color] print_list_of_lines(key[COLOR=#6a5acd],[/color] my_list) {
  [COLOR=#0000ff]# prints list of lines separated by ;[/color]
  n[COLOR=#6a5acd]=[/color][COLOR=#008b8b]split[/color](my_list[COLOR=#6a5acd],[/color]my_array[COLOR=#6a5acd],[/color][COLOR=#ff00ff]";"[/color])
  [COLOR=#a52a2a][b]for[/b][/color](i[COLOR=#6a5acd]=[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i [COLOR=#6a5acd]<=[/color] n[COLOR=#6a5acd];[/color] i[COLOR=#6a5acd]++[/color]) {
    line [COLOR=#6a5acd]=[/color] my_array[[COLOR=#6a5acd]i[/color]]
    [COLOR=#a52a2a][b]if[/b][/color] (i[COLOR=#6a5acd]==[/color][COLOR=#ff00ff]1[/color]) {
      line_begin [COLOR=#6a5acd]=[/color] key
    }
    [COLOR=#a52a2a][b]else[/b][/color] {
     line_begin [COLOR=#6a5acd]=[/color] [COLOR=#ff00ff]"..........."[/color]
    }
    [COLOR=#a52a2a][b]printf[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%11s[/color][COLOR=#ff00ff] .. [/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd],[/color] line_begin[COLOR=#6a5acd],[/color] line)
  }
}

judkil · Mar 24, 2020

Thank you very much !

However I would like to be able to put any size of pattern and not limited to a length of 11. What to put in the substr ?

mikrom · Mar 24, 2020

For the simplicity I took the pattern using substr() function:

Code:

  key=substr($0,1,11)

Other more flexible option, if you have a line like this:

Code:

AB-00050832-18.....1....-8.900758

then you can split it into an array and take the first 2 elements, like this:

Code:

split($0,key_array,"-")
key = key_array[1] "-" key_array[2]

So in case of

Code:

AB-00050832-18.....1....-8.900758

you will get the key value

Code:

AB-00050832

and e.g. for the case of

Code:

AB-VERY_LONG_KEY-18.....1....-8.900758

you will get the key value

Code:

AB-VERY_LONG_KEY

judkil · Mar 24, 2020

Thank you very much ! However, I have a syntax problem when I replace

Code:

 key = key_array[1] "-" key_array[2]

by

Code:

 key = 1[1] "-" 1[2]

The name being in the fields 1.

mikrom · Mar 24, 2020

There isn't a problem.

Here is the modified soure:
judkil_join.awk

Code:

[COLOR=#0000ff]# Run:[/color]
[COLOR=#0000ff]# awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
}
{ 
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]1[/color]]) {
    [COLOR=#0000ff]# get key from line[/color]
    [COLOR=#008b8b]split[/color]([COLOR=#6a5acd]$0[/color][COLOR=#6a5acd],[/color]key_array[COLOR=#6a5acd],[/color][COLOR=#ff00ff]"-"[/color])
    key [COLOR=#6a5acd]=[/color] key_array[[COLOR=#6a5acd]1[/color]] [COLOR=#ff00ff]"-"[/color] key_array[[COLOR=#6a5acd]2[/color]] 
    [COLOR=#0000ff]# add line from 1.file into array[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]key[/color]]) {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#ff00ff]";"[/color] [COLOR=#6a5acd]$0[/color]
    } 
    [COLOR=#a52a2a][b]else[/b][/color] {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] [COLOR=#6a5acd]$0[/color]
    }
  }
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]2[/color]]) {
    [COLOR=#0000ff]# print adequate lines from 2.file[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]$1[/color]]) {
      print_list_of_lines([COLOR=#6a5acd]$1[/color][COLOR=#6a5acd],[/color] line_array[[COLOR=#6a5acd]$1[/color]])
    }
  }
}
[COLOR=#6a5acd]END[/color] {
  [COLOR=#a52a2a][b]print[/b][/color] [COLOR=#ff00ff]"Done."[/color]
}

[COLOR=#a52a2a][b]function[/b][/color] print_list_of_lines(key[COLOR=#6a5acd],[/color] my_list) {
  [COLOR=#0000ff]# prints list of lines separated by ;[/color]
  n[COLOR=#6a5acd]=[/color][COLOR=#008b8b]split[/color](my_list[COLOR=#6a5acd],[/color]my_array[COLOR=#6a5acd],[/color][COLOR=#ff00ff]";"[/color])
  [COLOR=#a52a2a][b]for[/b][/color](i[COLOR=#6a5acd]=[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i [COLOR=#6a5acd]<=[/color] n[COLOR=#6a5acd];[/color] i[COLOR=#6a5acd]++[/color]) {
    line [COLOR=#6a5acd]=[/color] my_array[[COLOR=#6a5acd]i[/color]]
    [COLOR=#a52a2a][b]if[/b][/color] (i[COLOR=#6a5acd]==[/color][COLOR=#ff00ff]1[/color]) {
      line_begin [COLOR=#6a5acd]=[/color] key
    }
    [COLOR=#a52a2a][b]else[/b][/color] {
     line_begin [COLOR=#6a5acd]=[/color] [COLOR=#ff00ff]"..........."[/color]
    }
    [COLOR=#a52a2a][b]printf[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%-20s[/color][COLOR=#ff00ff] .. [/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd],[/color] line_begin[COLOR=#6a5acd],[/color] line)
  }
}

In the files I added some lines with long key:
judkil_file1.txt

Code:

AB-00050832-18.....1....-8.900758
AB-00058394-10.....2....-7.981418
AB-00050832-24.....3....-7.634420
AB-VERY_LONG_KEY-8......4....5.55
AB-00050862-10.....4....-7.621671
AB-00004123-1......5....-7.386272
AB-00058394-6.......6....-7.383604
AB-00050832-12....14....-7.038594
AB-VERY_LONG_KEY-7......3....4.44
AB-00050862-6.....50....-6.701126

judkil_file2.txt

Code:

AB-00050832
AB-VERY_LONG_KEY
AB-00058394
AB-00050862
AB-00004123

Now, when I run it I get this result:

Code:

$ awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt
AB-00050832          .. AB-00050832-18.....1....-8.900758
...........          .. AB-00050832-24.....3....-7.634420
...........          .. AB-00050832-12....14....-7.038594
AB-VERY_LONG_KEY     .. AB-VERY_LONG_KEY-8......4....5.55
...........          .. AB-VERY_LONG_KEY-7......3....4.44
AB-00058394          .. AB-00058394-10.....2....-7.981418
...........          .. AB-00058394-6.......6....-7.383604
AB-00050862          .. AB-00050862-10.....4....-7.621671
...........          .. AB-00050862-6.....50....-6.701126
AB-00004123          .. AB-00004123-1......5....-7.386272
Done.

judkil · Mar 24, 2020

But key_array ,t's a field ($1) or just a number

Code:

 split($0,$1,"-")
key = $1[1] "-" $1[2]

mikrom · Mar 24, 2020

This, what you try is complete wrong:

Code:

split($0,$1,"-")
key = $1[1] "-" $1[2]

Maybe you have to read in the manual how the the split function works:

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

Look at the source above, where I'm using

Code:

split($0,key_array,"-")
key = key_array[1] "-" key_array[2]

and try it to see how it works.

judkil · Mar 24, 2020

I just understood "key_array" a variable name. I thought it was a field ($1) or a number.

Thank you very much !

mikrom · Mar 24, 2020

key_array is the name I gave to the array which will be created by the split() function.
For example, if we have this line i.e. this string in the field $0:
"AB-VERY_LONG_KEY-8......4....5.55"
then
split($0,key_array,"-")
creates this array of 3 strings
key_array = ("AB" | "VERY_LONG_KEY" | "8......4....5.55")
i.e.
key_array[1] = "AB"
key_array[2] = "VERY_LONG_KEY"
key_array[3] = "8......4....5.55"
and then we create the key with concatenation
key = key_array[1] "-" key_array[2]
with the result
key = "AB-VERY_LONG_KEY"

I hope that I helped you to understand, how to use awk to solve your problems

judkil · Mar 24, 2020

I understand much better and especially to overcome the substr function!

Thank you!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

AWK getline and join by step

judkil

Technical User

mikrom

Programmer

judkil

Technical User

mikrom

Programmer

mikrom

Programmer

judkil

Technical User

mikrom

Programmer

judkil

Technical User

mikrom

Programmer

judkil

Technical User

mikrom

Programmer

judkil

Technical User

mikrom

Programmer

judkil

Technical User

Similar threads

Part and Inventory Search

Sponsor