Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

AWK getline and join by step

Status
Not open for further replies.

judkil

Technical User
Mar 23, 2020
9
FR
227/5000
Hello

I want to perform a join where AWK takes the name of a line in a first file, places it in fields 1 in the output file, then goes to a second file takes the extended names and places them in fields 2.

file_1:
AB-00050832
AB-00058394
AB-00050862
AB-00004123

file_2:
AB-00050832-18.....1....-8.900758
AB-00058394-10.....2....-7.981418
AB-00050832-24.....3....-7.634420
AB-00050862-10.....4....-7.621671
AB-00004123-1......5....-7.386272
AB-00058394-6.......6....-7.383604
AB-00050832-12....14....-7.038594
AB-00050862-6.....50....-6.701126

output:
AB-00050832.....AB-00050832-18.....1....-8.900758
........................AB-00050832-24.....3....-7.634420
........................AB-00050832-12....14....-7.038594
AB-00058394.....AB-00058394-10.....2....-7.981418
........................AB-00058394-6.......6....-7.383604
AB-00050862.....AB-00050862-10.....4....-7.621671
........................AB-00050862-6.....50....-6.701126
AB-00004123.....AB-00004123-1......5....-7.386272

I added points to better visualize, but they are not useful.
file_2 is:
- attached to file_1
- sorted according to file_1 following the order of file_1. It is just important to keep the order in which the name appears in file_1. I tried with a getline but I feel like there is something I can't unlock
He takes.

If you have an idea, thank you very much!
 
It is not the same problem because Awk must join the short name "B" to the extended name "B_1". We are
file_1:
A
B
C

file_2:
A_1
C_2
C_1
B_4
A_2
A_3
B_1

output:
A....A_1
......A_2
......A_3
B.....B_4
.......B_1
C....C_2
......C_1
The order of appearance in file_1 is respected and the order of appearance in file_2 is respected. It's not really a join of two identical columns are merged.


 
I modified the old script I mentioned above and got this result:
Code:
$ awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt
AB-00050832 .. AB-00050832-18.....1....-8.900758
........... .. AB-00050832-24.....3....-7.634420
........... .. AB-00050832-12....14....-7.038594
AB-00058394 .. AB-00058394-10.....2....-7.981418
........... .. AB-00058394-6.......6....-7.383604
AB-00050862 .. AB-00050862-10.....4....-7.621671
........... .. AB-00050862-6.....50....-6.701126
AB-00004123 .. AB-00004123-1......5....-7.386272
Done.
Is this the result you need?
On the command line I changed the order of the files: so my judkil_file1.txt is your file_2
and my judkil_file2.txt is your file_1
 
Hi judkil,
Is your problem already solved ? Here is my script from yesterday:
Code:
[COLOR=#0000ff]# Run:[/color]
[COLOR=#0000ff]# awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
}
{ 
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]1[/color]]) {
    [COLOR=#0000ff]# get key from line[/color]
    key[COLOR=#6a5acd]=[/color][COLOR=#008b8b]substr[/color]([COLOR=#6a5acd]$0[/color][COLOR=#6a5acd],[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd],[/color][COLOR=#ff00ff]11[/color])
    [COLOR=#0000ff]# add line from 1.file into array[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]key[/color]]) {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#ff00ff]";"[/color] [COLOR=#6a5acd]$0[/color]
    } 
    [COLOR=#a52a2a][b]else[/b][/color] {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] [COLOR=#6a5acd]$0[/color]
    }
  }
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]2[/color]]) {
    [COLOR=#0000ff]# print adequate lines from 2.file[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]$1[/color]]) {
      [COLOR=#0000ff]#print line_array[$1][/color]
      print_list_of_lines([COLOR=#6a5acd]$1[/color][COLOR=#6a5acd],[/color] line_array[[COLOR=#6a5acd]$1[/color]])
    }
  }
}
[COLOR=#6a5acd]END[/color] {
  [COLOR=#a52a2a][b]print[/b][/color] [COLOR=#ff00ff]"Done."[/color]
}

[COLOR=#a52a2a][b]function[/b][/color] print_list_of_lines(key[COLOR=#6a5acd],[/color] my_list) {
  [COLOR=#0000ff]# prints list of lines separated by ;[/color]
  n[COLOR=#6a5acd]=[/color][COLOR=#008b8b]split[/color](my_list[COLOR=#6a5acd],[/color]my_array[COLOR=#6a5acd],[/color][COLOR=#ff00ff]";"[/color])
  [COLOR=#a52a2a][b]for[/b][/color](i[COLOR=#6a5acd]=[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i [COLOR=#6a5acd]<=[/color] n[COLOR=#6a5acd];[/color] i[COLOR=#6a5acd]++[/color]) {
    line [COLOR=#6a5acd]=[/color] my_array[[COLOR=#6a5acd]i[/color]]
    [COLOR=#a52a2a][b]if[/b][/color] (i[COLOR=#6a5acd]==[/color][COLOR=#ff00ff]1[/color]) {
      line_begin [COLOR=#6a5acd]=[/color] key
    }
    [COLOR=#a52a2a][b]else[/b][/color] {
     line_begin [COLOR=#6a5acd]=[/color] [COLOR=#ff00ff]"..........."[/color]
    }
    [COLOR=#a52a2a][b]printf[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%11s[/color][COLOR=#ff00ff] .. [/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd],[/color] line_begin[COLOR=#6a5acd],[/color] line)
  }
}
 
Thank you very much !

However I would like to be able to put any size of pattern and not limited to a length of 11. What to put in the substr ?
 
For the simplicity I took the pattern using substr() function:
Code:
  key=substr($0,1,11)

Other more flexible option, if you have a line like this:
Code:
AB-00050832-18.....1....-8.900758
then you can split it into an array and take the first 2 elements, like this:
Code:
split($0,key_array,"-")
key = key_array[1] "-" key_array[2]
So in case of
Code:
AB-00050832-18.....1....-8.900758
you will get the key value
Code:
AB-00050832
and e.g. for the case of
Code:
AB-VERY_LONG_KEY-18.....1....-8.900758
you will get the key value
Code:
AB-VERY_LONG_KEY

 

Thank you very much ! However, I have a syntax problem when I replace
Code:
 key = key_array[1] "-" key_array[2]
by
Code:
 key = 1[1] "-" 1[2]
The name being in the fields 1.
 
There isn't a problem.

Here is the modified soure:
judkil_join.awk
Code:
[COLOR=#0000ff]# Run:[/color]
[COLOR=#0000ff]# awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
}
{ 
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]1[/color]]) {
    [COLOR=#0000ff]# get key from line[/color]
    [COLOR=#008b8b]split[/color]([COLOR=#6a5acd]$0[/color][COLOR=#6a5acd],[/color]key_array[COLOR=#6a5acd],[/color][COLOR=#ff00ff]"-"[/color])
    key [COLOR=#6a5acd]=[/color] key_array[[COLOR=#6a5acd]1[/color]] [COLOR=#ff00ff]"-"[/color] key_array[[COLOR=#6a5acd]2[/color]] 
    [COLOR=#0000ff]# add line from 1.file into array[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]key[/color]]) {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#ff00ff]";"[/color] [COLOR=#6a5acd]$0[/color]
    } 
    [COLOR=#a52a2a][b]else[/b][/color] {
      line_array[[COLOR=#6a5acd]key[/color]] [COLOR=#6a5acd]=[/color] [COLOR=#6a5acd]$0[/color]
    }
  }
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]FILENAME[/color] [COLOR=#6a5acd]==[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]2[/color]]) {
    [COLOR=#0000ff]# print adequate lines from 2.file[/color]
    [COLOR=#a52a2a][b]if[/b][/color] (line_array[[COLOR=#6a5acd]$1[/color]]) {
      print_list_of_lines([COLOR=#6a5acd]$1[/color][COLOR=#6a5acd],[/color] line_array[[COLOR=#6a5acd]$1[/color]])
    }
  }
}
[COLOR=#6a5acd]END[/color] {
  [COLOR=#a52a2a][b]print[/b][/color] [COLOR=#ff00ff]"Done."[/color]
}

[COLOR=#a52a2a][b]function[/b][/color] print_list_of_lines(key[COLOR=#6a5acd],[/color] my_list) {
  [COLOR=#0000ff]# prints list of lines separated by ;[/color]
  n[COLOR=#6a5acd]=[/color][COLOR=#008b8b]split[/color](my_list[COLOR=#6a5acd],[/color]my_array[COLOR=#6a5acd],[/color][COLOR=#ff00ff]";"[/color])
  [COLOR=#a52a2a][b]for[/b][/color](i[COLOR=#6a5acd]=[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i [COLOR=#6a5acd]<=[/color] n[COLOR=#6a5acd];[/color] i[COLOR=#6a5acd]++[/color]) {
    line [COLOR=#6a5acd]=[/color] my_array[[COLOR=#6a5acd]i[/color]]
    [COLOR=#a52a2a][b]if[/b][/color] (i[COLOR=#6a5acd]==[/color][COLOR=#ff00ff]1[/color]) {
      line_begin [COLOR=#6a5acd]=[/color] key
    }
    [COLOR=#a52a2a][b]else[/b][/color] {
     line_begin [COLOR=#6a5acd]=[/color] [COLOR=#ff00ff]"..........."[/color]
    }
    [COLOR=#a52a2a][b]printf[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%-20s[/color][COLOR=#ff00ff] .. [/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd],[/color] line_begin[COLOR=#6a5acd],[/color] line)
  }
}

In the files I added some lines with long key:
judkil_file1.txt
Code:
AB-00050832-18.....1....-8.900758
AB-00058394-10.....2....-7.981418
AB-00050832-24.....3....-7.634420
AB-VERY_LONG_KEY-8......4....5.55
AB-00050862-10.....4....-7.621671
AB-00004123-1......5....-7.386272
AB-00058394-6.......6....-7.383604
AB-00050832-12....14....-7.038594
AB-VERY_LONG_KEY-7......3....4.44
AB-00050862-6.....50....-6.701126

judkil_file2.txt
Code:
AB-00050832
AB-VERY_LONG_KEY
AB-00058394
AB-00050862
AB-00004123

Now, when I run it I get this result:
Code:
$ awk -f judkil_join.awk judkil_file1.txt judkil_file2.txt
AB-00050832          .. AB-00050832-18.....1....-8.900758
...........          .. AB-00050832-24.....3....-7.634420
...........          .. AB-00050832-12....14....-7.038594
AB-VERY_LONG_KEY     .. AB-VERY_LONG_KEY-8......4....5.55
...........          .. AB-VERY_LONG_KEY-7......3....4.44
AB-00058394          .. AB-00058394-10.....2....-7.981418
...........          .. AB-00058394-6.......6....-7.383604
AB-00050862          .. AB-00050862-10.....4....-7.621671
...........          .. AB-00050862-6.....50....-6.701126
AB-00004123          .. AB-00004123-1......5....-7.386272
Done.
 

But key_array ,t's a field ($1) or just a number
Code:
 split($0,$1,"-")
key = $1[1] "-" $1[2]

 
This, what you try is complete wrong:
Code:
split($0,$1,"-")
key = $1[1] "-" $1[2]
Maybe you have to read in the manual how the the split function works:
Look at the source above, where I'm using
Code:
split($0,key_array,"-")
key = key_array[1] "-" key_array[2]
and try it to see how it works.
 
I just understood "key_array" a variable name. I thought it was a field ($1) or a number.

Thank you very much !
 
key_array is the name I gave to the array which will be created by the split() function.
For example, if we have this line i.e. this string in the field $0:
"AB-VERY_LONG_KEY-8......4....5.55"
then
split($0,key_array,"-")
creates this array of 3 strings
key_array = ("AB" | "VERY_LONG_KEY" | "8......4....5.55")
i.e.
key_array[1] = "AB"
key_array[2] = "VERY_LONG_KEY"
key_array[3] = "8......4....5.55"
and then we create the key with concatenation
key = key_array[1] "-" key_array[2]
with the result
key = "AB-VERY_LONG_KEY"

I hope that I helped you to understand, how to use awk to solve your problems :)

 
I understand much better and especially to overcome the substr function!

Thank you! ;)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top