Append columns in order based on $1

fidge999 · Jul 4, 2014

Hi forum,

I've been away from awk in the last few years but have recently come back to it and am struggling [dazed]

. Can anyone help with the following?

Summary
I have a ascii file containing 20million+ lines.
The FS = <space>
Field 1 is always an accountid. There are multiple lines per accountid
Field 2 is always a type. (not sorted in alphabetical order)
Field 3 is the value (as per field 2 - not sorted)

Outcome
I want to generate an CSV output file that appends all details for each unique accountid onto 1 line. The 'values' need to be written in order even though the input may not be in order.

Sample Input File
accountid type value
12345 A 100
12345 B 200
12345 C 500
12345 D 900
12345 E 1000
23456 B 200
23456 D 300
23456 A 400
23456 E 500
23456 C 600

Expected Output
accountid,typeA,typeB,typeC,typeD,typeE
12345 100,200,500,900,1000
23456 400,200,600,300,500

Here's what I have currently;

Code:

nawk '{ if ( substr($1,1,1) => '1' ) {

                accountid=$1

                if ( $2 == "A" ) 
                        TA=$3
                else if ( $2 == "B" ) 
                        TB=$3
                else if ( $2 == "C" ) 
                        TC=$3
                else if ( $2 == "D" )
                        TD=$3
                else if ( $2 == "E" )
                        TE=$3

                if ((getline tmp) > 0) {
                        if (substr(tmp,1,6) != accountid ) {
                                print accountid ","TA","TB","TC","TD","TE 
                                close(tmp)

                        }
                }
        }
        
}' sample.file

The problem I'm having is that its skipping lines, so some of the 'values' aren't updated for the current accountid.
I suspect I'm making it way to complicated than it needs to be so if anyone can help it would be much appreciated.

thanks.

mikrom · Jul 4, 2014

Maybe something like this could help you to start.

Code:

[COLOR=#0000ff]# Run: awk -f fidge999.awk fidge999.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
  [COLOR=#6a5acd]FS[/color] = [COLOR=#ff00ff]" "[/color]
}

{ 
  [COLOR=#0000ff]# chomp current line[/color]
  chomp_line()
  [COLOR=#0000ff]#[/color]
  k = [COLOR=#6a5acd]$1[/color]
  v = [COLOR=#6a5acd]$3[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"'"[/color] k [COLOR=#ff00ff]"' -> '"[/color] v [COLOR=#ff00ff]"'"[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]if[/b][/color] (values[[COLOR=#6a5acd]k[/color]] == [COLOR=#ff00ff]""[/color]) {
    values[[COLOR=#6a5acd]k[/color]] = v
  }
  [COLOR=#804040][b]else[/b][/color] {
    values[[COLOR=#6a5acd]k[/color]] = values[[COLOR=#6a5acd]k[/color]] [COLOR=#ff00ff]","[/color] v
  }

}

[COLOR=#6a5acd]END[/color] {
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"Number of lines processed "[/color] [COLOR=#6a5acd]NR[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All values unsorted:"[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]

  [COLOR=#804040][b]for[/b][/color] (k in values) {
    [COLOR=#804040][b]print[/b][/color] k [COLOR=#ff00ff]"->"[/color] values[[COLOR=#6a5acd]k[/color]]
  }
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]

  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All keys:"[/color]
  i=[COLOR=#ff00ff]0[/color]
  [COLOR=#804040][b]for[/b][/color] (k in values) {
    [COLOR=#804040][b]print[/b][/color] k
    i++
    keys[[COLOR=#6a5acd]i[/color]] = k
  }
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]
  
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All sorted keys:"[/color]
  nr_keys = [COLOR=#008080]asort[/color](keys)
  [COLOR=#804040][b]for[/b][/color](i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i <= nr_keys[COLOR=#6a5acd];[/color] i++) {
    [COLOR=#804040][b]print[/b][/color] keys[[COLOR=#6a5acd]i[/color]]
  }
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]""[/color]

  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All values sorted:"[/color]
  [COLOR=#804040][b]for[/b][/color] (j=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] j <= nr_keys[COLOR=#6a5acd];[/color] j++) {
    [COLOR=#0000ff]# key sorted[/color]
    k = keys[[COLOR=#6a5acd]j[/color]]
    [COLOR=#0000ff]# split string into array[/color]
    [COLOR=#008080]split[/color](values[[COLOR=#6a5acd]k[/color]][COLOR=#6a5acd],[/color] values_lst[COLOR=#6a5acd],[/color] [COLOR=#ff00ff]","[/color])
    [COLOR=#0000ff]# sort the array[/color]
    n = [COLOR=#008080]asort[/color](values_lst)
    [COLOR=#0000ff]# join values_lst back to the values_str[/color]
    values_str = [COLOR=#ff00ff]""[/color]
    [COLOR=#804040][b]for[/b][/color](i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i <= n[COLOR=#6a5acd];[/color] i++) {
      [COLOR=#804040][b]if[/b][/color] (i==[COLOR=#ff00ff]1[/color]) {
        values_str = values_lst[[COLOR=#6a5acd]i[/color]]
      } 
      [COLOR=#804040][b]else[/b][/color] {
        values_str = values_str [COLOR=#ff00ff]","[/color] values_lst[[COLOR=#6a5acd]i[/color]]
      }
    }
    [COLOR=#0000ff]#[/color]
    values[[COLOR=#6a5acd]k[/color]] = values_str
    [COLOR=#804040][b]print[/b][/color] k [COLOR=#ff00ff]"->"[/color] values[[COLOR=#6a5acd]k[/color]]
  }
}

[COLOR=#0000ff]#[/color]
[COLOR=#804040][b]function[/b][/color] chomp_line() {
  [COLOR=#0000ff]# strip out the carriage return or line feed at the end of current line[/color]
  [COLOR=#0000ff]# the function modifies global variable $0 (current line)[/color]
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\r[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
}

Usage:

Code:

$ awk -f fidge999.awk fidge999.txt
'23456' -> '200'
'23456' -> '400'
'23456' -> '300'
'23456' -> '600'
'23456' -> '500'
'52312' -> '100'
'52312' -> '500'
'52312' -> '1000'
'52312' -> '200'
'52312' -> '900'
'12345' -> '100'
'12345' -> '500'
'12345' -> '1000'
'12345' -> '200'
'12345' -> '900'
Number of lines processed 15
All values unsorted:

52312->100,500,1000,200,900
12345->100,500,1000,200,900
23456->200,400,300,600,500

All keys:
52312
12345
23456

All sorted keys:
12345
23456
52312

All values sorted:
12345->100,200,500,900,1000
23456->200,300,400,500,600
52312->100,200,500,900,1000

mikrom · Jul 4, 2014

I forgot to say that I used this input data
fidge999.txt

Code:

23456 B 200
23456 A 400
23456 D 300
23456 C 600
23456 E 500
52312 A 100
52312 C 500
52312 E 1000
52312 B 200
52312 D 900
12345 A 100
12345 C 500
12345 E 1000
12345 B 200
12345 D 900

fidge999 · Jul 4, 2014

Thanks Mikrom. i think that'll give me something to go on with. Many thanks.

fidge999 · Jul 4, 2014

Hi Mikrom, unfortunatly the server i'm using is running Solaris and there is no way to update awk with GNU, so the "asort" function does not work.

Do you know of another way to sort?

thanks.

mikrom · Jul 4, 2014

then there are other possibilities:
1. look if there is a sort-statement in solaris' awk version
2. write your own sort function in awk (

http://stackoverflow.com/questions/5342782/sort-associative-array-with-awk)

3. try to preprocess your file using unix sort statement if available (

http://en.wikipedia.org/wiki/Sort_(Unix)

and then process the file with awk
...
etc

mikrom · Jul 4, 2014

Here is the alternative with sort statement

Code:

[COLOR=#0000ff]# Run: [/color]
[COLOR=#0000ff]# sort --key=1,3 fidge999.txt | awk -f fidge999.awk[/color]
[COLOR=#6a5acd]BEGIN[/color] {
  [COLOR=#6a5acd]FS[/color] = [COLOR=#ff00ff]" "[/color]
  nk=[COLOR=#ff00ff]0[/color]
}

{ 
  [COLOR=#0000ff]# chomp current line[/color]
  chomp_line()
  [COLOR=#0000ff]#[/color]
  k = [COLOR=#6a5acd]$1[/color]
  v = [COLOR=#6a5acd]$3[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"key, val: '"[/color] k [COLOR=#ff00ff]"' -> '"[/color] v [COLOR=#ff00ff]"'"[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]if[/b][/color] (values[[COLOR=#6a5acd]k[/color]] == [COLOR=#ff00ff]""[/color]) {
    values[[COLOR=#6a5acd]k[/color]] = v
    nk++
    keys[[COLOR=#6a5acd]nk[/color]] = k
  }
  [COLOR=#804040][b]else[/b][/color] {
    values[[COLOR=#6a5acd]k[/color]] = values[[COLOR=#6a5acd]k[/color]] [COLOR=#ff00ff]","[/color] v
  }

}

[COLOR=#6a5acd]END[/color] {
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"Number of lines processed "[/color] [COLOR=#6a5acd]NR[/color]
  [COLOR=#0000ff]#[/color]
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All keys:"[/color]
  [COLOR=#804040][b]for[/b][/color] (i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] i <= nk[COLOR=#6a5acd];[/color] i++) {
    [COLOR=#804040][b]print[/b][/color] keys[[COLOR=#6a5acd]i[/color]]
  }
  
  [COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"All values:"[/color]
  [COLOR=#804040][b]for[/b][/color] (i=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color]  i <= nk[COLOR=#6a5acd];[/color] i++) {
    k = keys[[COLOR=#6a5acd]i[/color]]
    [COLOR=#804040][b]print[/b][/color] k [COLOR=#ff00ff]"->"[/color] values[[COLOR=#6a5acd]k[/color]]
  }
}

[COLOR=#0000ff]#[/color]
[COLOR=#804040][b]function[/b][/color] chomp_line() {
  [COLOR=#0000ff]# strip out the carriage return or line feed at the end of current line[/color]
  [COLOR=#0000ff]# the function modifies global variable $0 (current line)[/color]
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\r[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
  [COLOR=#008080]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
}

Usage

Code:

$ sort --key=1,3 fidge999.txt | awk -f fidge999.awk
key, val: '12345' -> '100'
key, val: '12345' -> '200'
key, val: '12345' -> '500'
key, val: '12345' -> '900'
key, val: '12345' -> '1000'
key, val: '23456' -> '400'
key, val: '23456' -> '200'
key, val: '23456' -> '600'
key, val: '23456' -> '300'
key, val: '23456' -> '500'
key, val: '52312' -> '100'
key, val: '52312' -> '200'
key, val: '52312' -> '500'
key, val: '52312' -> '900'
key, val: '52312' -> '1000'
Number of lines processed 15
All keys:
12345
23456
52312
All values:
12345->100,200,500,900,1000
23456->400,200,600,300,500
52312->100,200,500,900,1000

mikrom · Jul 4, 2014

Now I see the values with key 23456 are not well sorted:
23456->400,200,600,300,500
I thought the sort function does it properly :-(

mikrom · Jul 4, 2014

This seems to work

Code:

$ sort -n +0 +2 fidge999.txt | awk -f fidge999.awk
key, val: '12345' -> '100'
key, val: '12345' -> '200'
key, val: '12345' -> '500'
key, val: '12345' -> '900'
key, val: '12345' -> '1000'
key, val: '23456' -> '200'
key, val: '23456' -> '300'
key, val: '23456' -> '400'
key, val: '23456' -> '500'
key, val: '23456' -> '600'
key, val: '52312' -> '100'
key, val: '52312' -> '200'
key, val: '52312' -> '500'
key, val: '52312' -> '900'
key, val: '52312' -> '1000'
Number of lines processed 15
All keys:
12345
23456
52312
All values:
12345->100,200,500,900,1000
23456->200,300,400,500,600
52312->100,200,500,900,1000

mikrom · Jul 6, 2014

Hi fidge999.
Just out of curiosity: does the approach with sort command work in Solaris or not?

fidge999 · Jul 7, 2014

hi Mikrom, I'll let you know later today. I'm not sure if the sort approach before awk will work given the size of my file (20m+ lines, 8GB size). I'll give it a try.

thanks.

feherke · Jul 7, 2014

Hi

fidge999 said:
20m+ lines, 8GB size

If you will have problem with the file size, maybe try this one. Is abit ugly and not flexible, but should work :

Code:

awk 'l!=$1&&l{print l,t["A"]","t["B"]","t["C"]","t["D"]","t["E"];delete t}{l=$1;t[$2]=$3}END{print l,t["A"]","t["B"]","t["C"]","t["D"]","t["E"]}' sample.file

Tested with [tt]gawk[/tt], [tt]mawk[/tt] and original [tt]awk[/tt].

Feherke.
feherke.ga

LKBrwnDBA · Jul 7, 2014

Perhaps there is a simpler way?

Code:

==> cat m1
awk 'BEGIN{PROCINFO["@ind_str_asc"]}
function out_line() {
  O[k0]=""
  for (i in L) {O[k0]=O[k0]","L[i];delete L[i];}
}
{k=$1;
if (k0!=""&&k!=k0) out_line();
k0=k;L[$2]=$3;
}
END {out_line();for(i in O)print i O[i]}
' input.txt

==> ./m1
12345,100,200,500,900,1000
23456,400,200,600,300,500
52312,100,200,500,900,1000

==>

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

feherke · Jul 7, 2014

Hi

Nice one, LKBrwnDBA, but I afraid the OP's this sentence voids it :

fidge999 said:
unfortunatly the server i'm using is running Solaris and there is no way to update awk with GNU

man gawk said:
[tt]The following features of gawk are not available in POSIX awk.
· The PROCINFO array is not available.[/tt]

As far as I know, neither [tt]nawk[/tt] has it.

Feherke.
feherke.ga

LKBrwnDBA · Jul 7, 2014

Ooops, missed that.
Perhaps using the nawk (or gawk) "asort()" or "asorti()" functions could do it?
Been a long time since I worked on Solaris, my current OS's are 99% Aix and 1% Redhat Linux.
[thumbsup2]

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

fidge999 · Jul 7, 2014

thanks Feherke & LKBrwnDBA for your suggestions. I'll give it a try

unfortunately the unix sort option did not work.

cheers all

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Append columns in order based on $1

fidge999

Technical User

mikrom

Programmer

mikrom

Programmer

fidge999

Technical User

fidge999

Technical User

mikrom

Programmer

mikrom

Programmer

mikrom

Programmer

mikrom

Programmer

mikrom

Programmer

fidge999

Technical User

feherke

Programmer

LKBrwnDBA

MIS

feherke

Programmer

LKBrwnDBA

MIS

fidge999

Technical User

Similar threads

Part and Inventory Search

Sponsor