crosstab with AWK 3

mslider · Jun 24, 2005

Cross tabulations are statistical reports where you de-normalize your data and show results grouped by one field, having one column for each distinct value of a second field.
This cross tabulations are usualy made using SQL language into dababase engine, but I want to fix it using a simple awk script to perform this crosstab analysis.

The Basic problem definition is :

starting from a list of values, we want to group them by field A and create a column for each distinct value of field B:

tag;gsm;count
AAAAAAAAAA;GSM31945;6
AAAAAAAAAA;GSM3240;6
AAAAAACCCA;GSM3242;6
AAAAAAAGCA;GSM3243;6
AAAAAATACA;GSM41375;1
AAAAAATTTA;GSM41375;1
AAAAACACTC;GSM41378;1
AAAAACAGAA;GSM41375;1
AAAAAAAAGG;GSM41376;9
AAAAAAACAA;GSM41376;1
AAAAAAGAAC;GSM41376;1
AAAAAAGCAG;GSM41376;1
AAAAAATTAT;GSM41376;1
AAAAAATTTA;GSM676;6

the desired result is a table with one column for field A, several columns for each value of field B:

tag;GSM31945;GSM3240;GSM3242;GSM3243;GSM41375;GSM41376;GSM41378;GSM676
AAAAAAAAAA;6;6;0;0;0;0;0;0
AAAAAAAAGG;0;0;0;0;0;9;0;0
AAAAAAACAA;0;0;0;0;0;1;0;0
AAAAAAAGCA;0;0;0;6;0;0;0;0
AAAAAACCCA;0;0;6;0;0;0;0;0
AAAAAAGAAC;0;0;0;0;0;1;0;0
AAAAAAGCAG;0;0;0;0;0;1;0;0
AAAAAATACA;0;0;0;0;1;0;0;0
AAAAAATTAT;0;0;0;0;0;1;0;0
AAAAAATTTA;0;0;0;0;1;0;0;6
AAAAACACTC;0;0;0;0;0;0;1;0
AAAAACAGAA;0;0;0;0;1;0;0;0

I try to use statements for scanning a "multi-dimensional" array in awk and I have some troubles in performing that, so I need somebody help to. Maybe you have already coded a piece of code to resolve this problem or some ideas. Please, tell me.

Regards,
Laurent --

duncdude · Jun 24, 2005

Hi Laurent

Don't know why... but i find this bioninformatics stuff very interesting

I can't quite get my head around how these 2 tables marry-up!?

Can you please explain to me

Kind Regards
Duncan

mslider · Jun 24, 2005

Dear Duncan,

these 2 tables don't marry-up!

first you have a tab (or comma) tabulated text file as the
format below:

tag;gsm;count
AAAAAAAAAA;GSM31945;6
AAAAAAAAAA;GSM3240;6
AAAAAACCCA;GSM3242;6
AAAAAAAGCA;GSM3243;6
AAAAAATACA;GSM41375;1
AAAAAATTTA;GSM41375;1
AAAAACACTC;GSM41378;1
AAAAACAGAA;GSM41375;1
AAAAAAAAGG;GSM41376;9
AAAAAAACAA;GSM41376;1
AAAAAAGAAC;GSM41376;1
AAAAAAGCAG;GSM41376;1
AAAAAATTAT;GSM41376;1
AAAAAATTTA;GSM676;6

the awk script (i try to code) take this file as input
and transform it using a crosstab (or pivot) function into
this new file below:

tag;GSM31945;GSM3240;GSM3242;GSM3243;GSM41375;GSM41376;GSM41378;GSM676
AAAAAAAAAA;6;6;0;0;0;0;0;0
AAAAAAAAGG;0;0;0;0;0;9;0;0
AAAAAAACAA;0;0;0;0;0;1;0;0
AAAAAAAGCA;0;0;0;6;0;0;0;0
AAAAAACCCA;0;0;6;0;0;0;0;0
AAAAAAGAAC;0;0;0;0;0;1;0;0
AAAAAAGCAG;0;0;0;0;0;1;0;0
AAAAAATACA;0;0;0;0;1;0;0;0
AAAAAATTAT;0;0;0;0;0;1;0;0
AAAAAATTTA;0;0;0;0;1;0;0;6
AAAAACACTC;0;0;0;0;0;0;1;0
AAAAACAGAA;0;0;0;0;1;0;0;0

but it's not very easy to do because it's not a basic
matrix permutation, you see.

Regards,
Laurent.

vgersh99 · Jun 24, 2005

something like this should get you started. I don't fully understand the pattern, but I think it's close

Code:

BEGIN {
  FS=OFS=";"
}

{
  if ( !($2 in cols) ) cols[$2]=++colsN
  tag[$1 , $2] = $3
}

END {
   # inverting the assoviative array - make it an indexed array
   for ( i in cols)
     cols[cols[i]] = i

   for( i in tag) {
     split(i, iA, SUBSEP)
     printf("%s%s", iA[1], OFS)
     for(i=1; i <= colsN; i++) {
        idx=iA[1] SUBSEP cols[i]
        printf("%s%s", (idx in tag) ? tag[idx] : "0", (i<colsN) ? OFS : "\n")
     }
   }
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

vgersh99 · Jun 24, 2005

still a bit 'raw'... some problems here.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

vgersh99 · Jun 24, 2005

a bit better....

Code:

BEGIN {
  FS=OFS=";"
}

{
  if ( !($2 in cols) ) cols[$2]=++colsN
  tag[$1 , $2] = $3
}

END {
   # inverting the assoviative array - make it an indexed array
   for ( i in cols)
     cols[cols[i]] = i

   printf("tag%s", OFS)
   for(i=1; i<= colsN; i++)
      printf("%s%s", cols[i], (i<colsN) ? OFS : "\n")

   for( i in tag) {
     split(i, iA, SUBSEP)
     printf("%s%s", iA[1], OFS)
     for(i=1; i <= colsN; i++) {
        idx=iA[1] SUBSEP cols[i]
        printf("%s%s", (idx in tag) ? tag[idx] : "0", (i<colsN) ? OFS : "\n")
        if ( idx in tag) delete tag[idx]
     }
   }
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

mslider · Jun 24, 2005

Congratulation vlad and thank you so much for this help,
your code work fine.
So this the entire code with comments and description:

#!/usr/bin/awk -f
################################################################################
# crosstab.awk perform cross table analysis
# need to use a crosstab function to perform cross table (matrix permutation).
# this script simulate a pivot SQL query like is used in database engine.
################################################################################
# The Basic problem definition is :
#
# starting from a list of values tab or comma delimited, we want to group them
# by first field (duplicates are removed) and create a column for each distinct
# value of second field. The program take a file an input formated as:
#
# string ;ID ;count (headers column)
# AAAAAAAAAA;GSM1 ;17
# AAAAAAATCA;GSM1 ;1
# AAAAAAATTT;GSM1 ;1
# AAAAACAAAA;GSM1 ;1
# AAAAAAAAAA;GSM10419;54
# AAAAAAAAAC;GSM10419;2
# AAAAAACTAC;GSM10419;1
# AAAAAACTCC;GSM10419;2
# AAAAAACTGA;GSM10419;1
# AAAAAAGAAA;GSM10419;3
# AAAAAAGGCA;GSM10419;14
# AAAAACATAG;GSM10424;1
# AAAAACATCC;GSM10424;1
# AAAAACCAAA;GSM10424;1
# AAAAACCTAC;GSM10424;1
# AAAAAAAAAA;GSM10425;15
# AAAAAAAAAG;GSM10425;2
# AAAAAAAAAT;GSM10425;1
# AAAAAAAGAA;GSM10425;1
# AAAAAAAGAG;GSM10425;4
#
# the desired result is a table stored as a tab or comma delimited text file
# with one column for first field, several columns for each value
# of second field as:
#
# string ;GSM1;GSM10419;GSM10424;GSM10425 (headers column)
# AAAAAAAAAA;17 ;54 ;0 ;15
# AAAAAAAAAC;0 ;2 ;0 ;0
# AAAAAAAAAG;0 ;0 ;0 ;2
# AAAAAAAAAT;0 ;0 ;0 ;1
# AAAAAAAGAA;0 ;0 ;0 ;1
# AAAAAAAGAG;0 ;0 ;0 ;4
# AAAAAAATCA;1 ;0 ;0 ;0
# AAAAAAATTT;1 ;0 ;0 ;0
# AAAAAACTAC;0 ;1 ;0 ;0
# AAAAAACTCC;0 ;2 ;0 ;0
# AAAAAACTGA;0 ;1 ;0 ;0
# AAAAAAGAAA;0 ;3 ;0 ;0
# AAAAAAGGCA;0 ;14 ;0 ;0
# AAAAACAAAA;1 ;0 ;0 ;0
# AAAAACATAG;0 ;0 ;1 ;0
# AAAAACATCC;0 ;0 ;1 ;0
# AAAAACCAAA;0 ;0 ;1 ;0
# AAAAACCTAC;0 ;0 ;1 ;0
################################################################################

BEGIN {
FS=OFS=";"
}

{
if ( !($2 in cols) ) cols[$2]=++colsN
tag[$1 , $2] = $3
}

END {
# inverting the assoviative array - make it an indexed array
for ( i in cols)
cols[cols] = i

printf("tag%s", OFS)
for(i=1; i<= colsN; i++){
printf("%s%s", cols, (i<colsN) ? OFS : "\n")
}
for( i in tag) {
split(i, iA, SUBSEP)
printf("%s%s", iA[1], OFS)
for(i=1; i <= colsN; i++) {
idx=iA[1] SUBSEP cols
printf("%s%s", (idx in tag) ? tag[idx] : "0", (i<colsN) ? OFS : "\n")
if ( idx in tag) delete tag[idx]
}
}
}

mslider · Jun 24, 2005

I think something can be added to speed-up the code, maybe by using a function involved in the comparison step into the loop... I work it on...If you have some ideas, please fun !

futurelet · Jun 24, 2005

Code:

BEGIN { FS = ";" ; ORS = "" }

{ if ( ! ($2 in gsms) )
  { gsms[ $2 ]
    gsm_seq[ ++gcount ] = $2
  }
  if ( ! ($1 in tags) )
  { tags[ $1 ]
    tag_seq[ ++tcount ] = $1
  }
  data[$1,$2] = $3
}

END {
  for (i=0; i <= tcount; i++)
  { tag = tag_seq[i]
    print i ? tag : "tag"
    for (j=1; j in gsm_seq; j++ )
    { gsm = gsm_seq[j]
      print ";" (i ? (0 + data[tag,gsm]) : gsm )
    }
    print "\n"
  }
}

mslider · Jun 28, 2005

thanx again to futurelet for this code.

this awk code is pretty reasonable but curiously it's not get a big amount speedups probably because elements in "data" array (inside the loop) are not deleted with delete statement with each turn of loop. Vlad do it in his code.

mslider · Jun 28, 2005

to Vlad -->
I dont why, but using gawk instead awk your code returns mismatch with duplicates lines !

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

crosstab with AWK 3

mslider

Programmer

duncdude

Programmer

mslider

Programmer

vgersh99

Programmer

vgersh99

Programmer

vgersh99

Programmer

mslider

Programmer

mslider

Programmer

futurelet

Programmer

mslider

Programmer

mslider

Programmer

Similar threads

Part and Inventory Search

Sponsor