Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Append columns in order based on $1

Status
Not open for further replies.

fidge999

Technical User
Jul 15, 2013
8
GB
Hi forum,

I've been away from awk in the last few years but have recently come back to it and am struggling [dazed]. Can anyone help with the following?

Summary
I have a ascii file containing 20million+ lines.
The FS = <space>
Field 1 is always an accountid. There are multiple lines per accountid
Field 2 is always a type. (not sorted in alphabetical order)
Field 3 is the value (as per field 2 - not sorted)

Outcome
I want to generate an CSV output file that appends all details for each unique accountid onto 1 line. The 'values' need to be written in order even though the input may not be in order.

Sample Input File
accountid type value
12345 A 100
12345 B 200
12345 C 500
12345 D 900
12345 E 1000
23456 B 200
23456 D 300
23456 A 400
23456 E 500
23456 C 600

Expected Output
accountid,typeA,typeB,typeC,typeD,typeE
12345 100,200,500,900,1000
23456 400,200,600,300,500

Here's what I have currently;
Code:
nawk '{ if ( substr($1,1,1) => '1' ) {

                accountid=$1

                if ( $2 == "A" ) 
                        TA=$3
                else if ( $2 == "B" ) 
                        TB=$3
                else if ( $2 == "C" ) 
                        TC=$3
                else if ( $2 == "D" )
                        TD=$3
                else if ( $2 == "E" )
                        TE=$3

                if ((getline tmp) > 0) {
                        if (substr(tmp,1,6) != accountid ) {
                                print accountid ","TA","TB","TC","TD","TE 
                                close(tmp)

                        }
                }
        }
        
}' sample.file

The problem I'm having is that its skipping lines, so some of the 'values' aren't updated for the current accountid.
I suspect I'm making it way to complicated than it needs to be so if anyone can help it would be much appreciated.

thanks.
 
Maybe something like this could help you to start.
Code:
[COLOR=#0000ff]# Run: awk -f fidge999.awk fidge999.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
  [COLOR=#6a5acd]FS[/color] = [COLOR=#ff00ff]" "[/color]
}

{ 
  [COLOR=#0000ff]# chomp current line[/color]
  chomp_line()
  [COLOR=#0000ff]#[/color]
  k = [COLOR=#6a5acd]$1[/color]
  v = [COLOR=#6a5acd]$3[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"'"[/color] k [COLOR=#ff00ff]"' -> '"[/color] v [COLOR=#ff00ff]"'"[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]if[/b][/color] (values[[COLOR=#6a5acd]k[/color]] == [COLOR=#ff00ff]""[/color]) {
    values[[COLOR=#6a5acd]k[/color]] = v
  }
  [COLOR=#804040][b]else[/b][/color] {
    values[[COLOR=#6a5acd]k[/color]] = values[[COLOR=#6a5acd]k[/color]] [COLOR=#ff00ff]","[/color] v
  }

}

[COLOR=#6a5acd]END[/color] {
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"Number of lines processed "[/color] [COLOR=#6a5acd]NR[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All values unsorted:"[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]

  [COLOR=#804040][b]for[/b][/color] (k in values) {
    [COLOR=#804040][b]print[/b][/color] k [COLOR=#ff00ff]"->"[/color] values[[COLOR=#6a5acd]k[/color]]
  }
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]

  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All keys:"[/color]
  i=[COLOR=#ff00ff]0[/color]
  [COLOR=#804040][b]for[/b][/color] (k in values) {
    [COLOR=#804040][b]print[/b][/color] k
    i++
    keys[[COLOR=#6a5acd]i[/color]] = k
  }
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]
  
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All sorted keys:"[/color]
  nr_keys = [COLOR=#008080]asort[/color](keys)
  [COLOR=#804040][b]for[/b][/color](i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i <= nr_keys[COLOR=#6a5acd];[/color] i++) {
    [COLOR=#804040][b]print[/b][/color] keys[[COLOR=#6a5acd]i[/color]]
  }
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]

  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All values sorted:"[/color]
  [COLOR=#804040][b]for[/b][/color] (j=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] j <= nr_keys[COLOR=#6a5acd];[/color] j++) {
    [COLOR=#0000ff]# key sorted[/color]
    k = keys[[COLOR=#6a5acd]j[/color]]
    [COLOR=#0000ff]# split string into array[/color]
    [COLOR=#008080]split[/color](values[[COLOR=#6a5acd]k[/color]][COLOR=#6a5acd],[/color] values_lst[COLOR=#6a5acd],[/color] [COLOR=#ff00ff]","[/color])
    [COLOR=#0000ff]# sort the array[/color]
    n = [COLOR=#008080]asort[/color](values_lst)
    [COLOR=#0000ff]# join values_lst back to the values_str[/color]
    values_str = [COLOR=#ff00ff]""[/color]
    [COLOR=#804040][b]for[/b][/color](i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i <= n[COLOR=#6a5acd];[/color] i++) {
      [COLOR=#804040][b]if[/b][/color] (i==[COLOR=#ff00ff]1[/color]) {
        values_str = values_lst[[COLOR=#6a5acd]i[/color]]
      } 
      [COLOR=#804040][b]else[/b][/color] {
        values_str = values_str [COLOR=#ff00ff]","[/color] values_lst[[COLOR=#6a5acd]i[/color]]
      }
    }
    [COLOR=#0000ff]#[/color]
    values[[COLOR=#6a5acd]k[/color]] = values_str
    [COLOR=#804040][b]print[/b][/color] k [COLOR=#ff00ff]"->"[/color] values[[COLOR=#6a5acd]k[/color]]
  }
}

[COLOR=#0000ff]#[/color]
[COLOR=#804040][b]function[/b][/color] chomp_line() {
  [COLOR=#0000ff]# strip out the carriage return or line feed at the end of current line[/color]
  [COLOR=#0000ff]# the function modifies global variable $0 (current line)[/color]
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\r[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
}

Usage:
Code:
$ awk -f fidge999.awk fidge999.txt
'23456' -> '200'
'23456' -> '400'
'23456' -> '300'
'23456' -> '600'
'23456' -> '500'
'52312' -> '100'
'52312' -> '500'
'52312' -> '1000'
'52312' -> '200'
'52312' -> '900'
'12345' -> '100'
'12345' -> '500'
'12345' -> '1000'
'12345' -> '200'
'12345' -> '900'
Number of lines processed 15
All values unsorted:

52312->100,500,1000,200,900
12345->100,500,1000,200,900
23456->200,400,300,600,500

All keys:
52312
12345
23456

All sorted keys:
12345
23456
52312

All values sorted:
12345->100,200,500,900,1000
23456->200,300,400,500,600
52312->100,200,500,900,1000
 
I forgot to say that I used this input data
fidge999.txt
Code:
23456 B 200
23456 A 400
23456 D 300
23456 C 600
23456 E 500
52312 A 100
52312 C 500
52312 E 1000
52312 B 200
52312 D 900
12345 A 100
12345 C 500
12345 E 1000
12345 B 200
12345 D 900
 
Thanks Mikrom. i think that'll give me something to go on with. Many thanks.
 
Hi Mikrom, unfortunatly the server i'm using is running Solaris and there is no way to update awk with GNU, so the "asort" function does not work.

Do you know of another way to sort?

thanks.
 
Here is the alternative with sort statement

Code:
[COLOR=#0000ff]# Run: [/color]
[COLOR=#0000ff]# sort --key=1,3 fidge999.txt | awk -f fidge999.awk[/color]
[COLOR=#6a5acd]BEGIN[/color] {
  [COLOR=#6a5acd]FS[/color] = [COLOR=#ff00ff]" "[/color]
  nk=[COLOR=#ff00ff]0[/color]
}

{ 
  [COLOR=#0000ff]# chomp current line[/color]
  chomp_line()
  [COLOR=#0000ff]#[/color]
  k = [COLOR=#6a5acd]$1[/color]
  v = [COLOR=#6a5acd]$3[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"key, val: '"[/color] k [COLOR=#ff00ff]"' -> '"[/color] v [COLOR=#ff00ff]"'"[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]if[/b][/color] (values[[COLOR=#6a5acd]k[/color]] == [COLOR=#ff00ff]""[/color]) {
    values[[COLOR=#6a5acd]k[/color]] = v
    nk++
    keys[[COLOR=#6a5acd]nk[/color]] = k
  }
  [COLOR=#804040][b]else[/b][/color] {
    values[[COLOR=#6a5acd]k[/color]] = values[[COLOR=#6a5acd]k[/color]] [COLOR=#ff00ff]","[/color] v
  }

}

[COLOR=#6a5acd]END[/color] {
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"Number of lines processed "[/color] [COLOR=#6a5acd]NR[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All keys:"[/color]
  [COLOR=#804040][b]for[/b][/color] (i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i <= nk[COLOR=#6a5acd];[/color] i++) {
    [COLOR=#804040][b]print[/b][/color] keys[[COLOR=#6a5acd]i[/color]]
  }
  
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All values:"[/color]
  [COLOR=#804040][b]for[/b][/color] (i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color]  i <= nk[COLOR=#6a5acd];[/color] i++) {
    k = keys[[COLOR=#6a5acd]i[/color]]
    [COLOR=#804040][b]print[/b][/color] k [COLOR=#ff00ff]"->"[/color] values[[COLOR=#6a5acd]k[/color]]
  }
}

[COLOR=#0000ff]#[/color]
[COLOR=#804040][b]function[/b][/color] chomp_line() {
  [COLOR=#0000ff]# strip out the carriage return or line feed at the end of current line[/color]
  [COLOR=#0000ff]# the function modifies global variable $0 (current line)[/color]
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\r[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
}

Usage
Code:
$ sort --key=1,3 fidge999.txt | awk -f fidge999.awk
key, val: '12345' -> '100'
key, val: '12345' -> '200'
key, val: '12345' -> '500'
key, val: '12345' -> '900'
key, val: '12345' -> '1000'
key, val: '23456' -> '400'
key, val: '23456' -> '200'
key, val: '23456' -> '600'
key, val: '23456' -> '300'
key, val: '23456' -> '500'
key, val: '52312' -> '100'
key, val: '52312' -> '200'
key, val: '52312' -> '500'
key, val: '52312' -> '900'
key, val: '52312' -> '1000'
Number of lines processed 15
All keys:
12345
23456
52312
All values:
12345->100,200,500,900,1000
23456->400,200,600,300,500
52312->100,200,500,900,1000
 
Now I see the values with key 23456 are not well sorted:
23456->400,200,600,300,500
I thought the sort function does it properly :-(
 
This seems to work
Code:
$ sort -n +0 +2 fidge999.txt | awk -f fidge999.awk
key, val: '12345' -> '100'
key, val: '12345' -> '200'
key, val: '12345' -> '500'
key, val: '12345' -> '900'
key, val: '12345' -> '1000'
key, val: '23456' -> '200'
key, val: '23456' -> '300'
key, val: '23456' -> '400'
key, val: '23456' -> '500'
key, val: '23456' -> '600'
key, val: '52312' -> '100'
key, val: '52312' -> '200'
key, val: '52312' -> '500'
key, val: '52312' -> '900'
key, val: '52312' -> '1000'
Number of lines processed 15
All keys:
12345
23456
52312
All values:
12345->100,200,500,900,1000
23456->200,300,400,500,600
52312->100,200,500,900,1000
 
Hi fidge999.
Just out of curiosity: does the approach with sort command work in Solaris or not?
 
hi Mikrom, I'll let you know later today. I'm not sure if the sort approach before awk will work given the size of my file (20m+ lines, 8GB size). I'll give it a try.

thanks.
 
Hi

fidge999 said:
20m+ lines, 8GB size
If you will have problem with the file size, maybe try this one. Is abit ugly and not flexible, but should work :
Code:
awk 'l!=$1&&l{print l,t["A"]","t["B"]","t["C"]","t["D"]","t["E"];delete t}{l=$1;t[$2]=$3}END{print l,t["A"]","t["B"]","t["C"]","t["D"]","t["E"]}' sample.file
Tested with [tt]gawk[/tt], [tt]mawk[/tt] and original [tt]awk[/tt].


Feherke.
feherke.ga
 

Perhaps there is a simpler way?
Code:
==> cat m1
awk 'BEGIN{PROCINFO["@ind_str_asc"]}
function out_line() {
  O[k0]=""
  for (i in L) {O[k0]=O[k0]","L[i];delete L[i];}
}
{k=$1;
if (k0!=""&&k!=k0) out_line();
k0=k;L[$2]=$3;
}
END {out_line();for(i in O)print i O[i]}
' input.txt

==> ./m1
12345,100,200,500,900,1000
23456,400,200,600,300,500
52312,100,200,500,900,1000

==>
[3eyes]

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Hi

Nice one, LKBrwnDBA, but I afraid the OP's this sentence voids it :
fidge999 said:
unfortunatly the server i'm using is running Solaris and there is no way to update awk with GNU

man gawk said:
[tt]The following features of gawk are not available in POSIX awk.
· The PROCINFO array is not available.[/tt]
As far as I know, neither [tt]nawk[/tt] has it.


Feherke.
feherke.ga
 
Ooops, missed that.
Perhaps using the nawk (or gawk) "asort()" or "asorti()" functions could do it?
Been a long time since I worked on Solaris, my current OS's are 99% Aix and 1% Redhat Linux.
[thumbsup2]



----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
thanks Feherke & LKBrwnDBA for your suggestions. I'll give it a try

unfortunately the unix sort option did not work.

cheers all
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top