Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

crosstab with AWK 3

Status
Not open for further replies.

mslider

Programmer
Jun 24, 2005
9
FR
Cross tabulations are statistical reports where you de-normalize your data and show results grouped by one field, having one column for each distinct value of a second field.
This cross tabulations are usualy made using SQL language into dababase engine, but I want to fix it using a simple awk script to perform this crosstab analysis.

The Basic problem definition is :

starting from a list of values, we want to group them by field A and create a column for each distinct value of field B:

tag;gsm;count
AAAAAAAAAA;GSM31945;6
AAAAAAAAAA;GSM3240;6
AAAAAACCCA;GSM3242;6
AAAAAAAGCA;GSM3243;6
AAAAAATACA;GSM41375;1
AAAAAATTTA;GSM41375;1
AAAAACACTC;GSM41378;1
AAAAACAGAA;GSM41375;1
AAAAAAAAGG;GSM41376;9
AAAAAAACAA;GSM41376;1
AAAAAAGAAC;GSM41376;1
AAAAAAGCAG;GSM41376;1
AAAAAATTAT;GSM41376;1
AAAAAATTTA;GSM676;6

the desired result is a table with one column for field A, several columns for each value of field B:

tag;GSM31945;GSM3240;GSM3242;GSM3243;GSM41375;GSM41376;GSM41378;GSM676
AAAAAAAAAA;6;6;0;0;0;0;0;0
AAAAAAAAGG;0;0;0;0;0;9;0;0
AAAAAAACAA;0;0;0;0;0;1;0;0
AAAAAAAGCA;0;0;0;6;0;0;0;0
AAAAAACCCA;0;0;6;0;0;0;0;0
AAAAAAGAAC;0;0;0;0;0;1;0;0
AAAAAAGCAG;0;0;0;0;0;1;0;0
AAAAAATACA;0;0;0;0;1;0;0;0
AAAAAATTAT;0;0;0;0;0;1;0;0
AAAAAATTTA;0;0;0;0;1;0;0;6
AAAAACACTC;0;0;0;0;0;0;1;0
AAAAACAGAA;0;0;0;0;1;0;0;0

I try to use statements for scanning a "multi-dimensional" array in awk and I have some troubles in performing that, so I need somebody help to. Maybe you have already coded a piece of code to resolve this problem or some ideas. Please, tell me.

Regards,
Laurent --
 
Hi Laurent

Don't know why... but i find this bioninformatics stuff very interesting

I can't quite get my head around how these 2 tables marry-up!?

Can you please explain to me


Kind Regards
Duncan
 
Dear Duncan,

these 2 tables don't marry-up!

first you have a tab (or comma) tabulated text file as the
format below:

tag;gsm;count
AAAAAAAAAA;GSM31945;6
AAAAAAAAAA;GSM3240;6
AAAAAACCCA;GSM3242;6
AAAAAAAGCA;GSM3243;6
AAAAAATACA;GSM41375;1
AAAAAATTTA;GSM41375;1
AAAAACACTC;GSM41378;1
AAAAACAGAA;GSM41375;1
AAAAAAAAGG;GSM41376;9
AAAAAAACAA;GSM41376;1
AAAAAAGAAC;GSM41376;1
AAAAAAGCAG;GSM41376;1
AAAAAATTAT;GSM41376;1
AAAAAATTTA;GSM676;6

the awk script (i try to code) take this file as input
and transform it using a crosstab (or pivot) function into
this new file below:

tag;GSM31945;GSM3240;GSM3242;GSM3243;GSM41375;GSM41376;GSM41378;GSM676
AAAAAAAAAA;6;6;0;0;0;0;0;0
AAAAAAAAGG;0;0;0;0;0;9;0;0
AAAAAAACAA;0;0;0;0;0;1;0;0
AAAAAAAGCA;0;0;0;6;0;0;0;0
AAAAAACCCA;0;0;6;0;0;0;0;0
AAAAAAGAAC;0;0;0;0;0;1;0;0
AAAAAAGCAG;0;0;0;0;0;1;0;0
AAAAAATACA;0;0;0;0;1;0;0;0
AAAAAATTAT;0;0;0;0;0;1;0;0
AAAAAATTTA;0;0;0;0;1;0;0;6
AAAAACACTC;0;0;0;0;0;0;1;0
AAAAACAGAA;0;0;0;0;1;0;0;0

but it's not very easy to do because it's not a basic
matrix permutation, you see.

Regards,
Laurent.

 
something like this should get you started. I don't fully understand the pattern, but I think it's close ;)
Code:
BEGIN {
  FS=OFS=";"
}

{
  if ( !($2 in cols) ) cols[$2]=++colsN
  tag[$1 , $2] = $3
}

END {
   # inverting the assoviative array - make it an indexed array
   for ( i in cols)
     cols[cols[i]] = i

   for( i in tag) {
     split(i, iA, SUBSEP)
     printf("%s%s", iA[1], OFS)
     for(i=1; i <= colsN; i++) {
        idx=iA[1] SUBSEP cols[i]
        printf("%s%s", (idx in tag) ? tag[idx] : "0", (i<colsN) ? OFS : "\n")
     }
   }
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
still a bit 'raw'... some problems here.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
a bit better....
Code:
BEGIN {
  FS=OFS=";"
}

{
  if ( !($2 in cols) ) cols[$2]=++colsN
  tag[$1 , $2] = $3
}

END {
   # inverting the assoviative array - make it an indexed array
   for ( i in cols)
     cols[cols[i]] = i

   printf("tag%s", OFS)
   for(i=1; i<= colsN; i++)
      printf("%s%s", cols[i], (i<colsN) ? OFS : "\n")

   for( i in tag) {
     split(i, iA, SUBSEP)
     printf("%s%s", iA[1], OFS)
     for(i=1; i <= colsN; i++) {
        idx=iA[1] SUBSEP cols[i]
        printf("%s%s", (idx in tag) ? tag[idx] : "0", (i<colsN) ? OFS : "\n")
        if ( idx in tag) delete tag[idx]
     }
   }
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Congratulation vlad and thank you so much for this help,
your code work fine.
So this the entire code with comments and description:

#!/usr/bin/awk -f
################################################################################
# crosstab.awk perform cross table analysis
# need to use a crosstab function to perform cross table (matrix permutation).
# this script simulate a pivot SQL query like is used in database engine.
################################################################################
# The Basic problem definition is :
#
# starting from a list of values tab or comma delimited, we want to group them
# by first field (duplicates are removed) and create a column for each distinct
# value of second field. The program take a file an input formated as:
#
# string ;ID ;count (headers column)
# AAAAAAAAAA;GSM1 ;17
# AAAAAAATCA;GSM1 ;1
# AAAAAAATTT;GSM1 ;1
# AAAAACAAAA;GSM1 ;1
# AAAAAAAAAA;GSM10419;54
# AAAAAAAAAC;GSM10419;2
# AAAAAACTAC;GSM10419;1
# AAAAAACTCC;GSM10419;2
# AAAAAACTGA;GSM10419;1
# AAAAAAGAAA;GSM10419;3
# AAAAAAGGCA;GSM10419;14
# AAAAACATAG;GSM10424;1
# AAAAACATCC;GSM10424;1
# AAAAACCAAA;GSM10424;1
# AAAAACCTAC;GSM10424;1
# AAAAAAAAAA;GSM10425;15
# AAAAAAAAAG;GSM10425;2
# AAAAAAAAAT;GSM10425;1
# AAAAAAAGAA;GSM10425;1
# AAAAAAAGAG;GSM10425;4
#
# the desired result is a table stored as a tab or comma delimited text file
# with one column for first field, several columns for each value
# of second field as:
#
# string ;GSM1;GSM10419;GSM10424;GSM10425 (headers column)
# AAAAAAAAAA;17 ;54 ;0 ;15
# AAAAAAAAAC;0 ;2 ;0 ;0
# AAAAAAAAAG;0 ;0 ;0 ;2
# AAAAAAAAAT;0 ;0 ;0 ;1
# AAAAAAAGAA;0 ;0 ;0 ;1
# AAAAAAAGAG;0 ;0 ;0 ;4
# AAAAAAATCA;1 ;0 ;0 ;0
# AAAAAAATTT;1 ;0 ;0 ;0
# AAAAAACTAC;0 ;1 ;0 ;0
# AAAAAACTCC;0 ;2 ;0 ;0
# AAAAAACTGA;0 ;1 ;0 ;0
# AAAAAAGAAA;0 ;3 ;0 ;0
# AAAAAAGGCA;0 ;14 ;0 ;0
# AAAAACAAAA;1 ;0 ;0 ;0
# AAAAACATAG;0 ;0 ;1 ;0
# AAAAACATCC;0 ;0 ;1 ;0
# AAAAACCAAA;0 ;0 ;1 ;0
# AAAAACCTAC;0 ;0 ;1 ;0
################################################################################

BEGIN {
FS=OFS=";"
}

{
if ( !($2 in cols) ) cols[$2]=++colsN
tag[$1 , $2] = $3
}

END {
# inverting the assoviative array - make it an indexed array
for ( i in cols)
cols[cols] = i

printf("tag%s", OFS)
for(i=1; i<= colsN; i++){
printf("%s%s", cols, (i<colsN) ? OFS : "\n")
}
for( i in tag) {
split(i, iA, SUBSEP)
printf("%s%s", iA[1], OFS)
for(i=1; i <= colsN; i++) {
idx=iA[1] SUBSEP cols
printf("%s%s", (idx in tag) ? tag[idx] : "0", (i<colsN) ? OFS : "\n")
if ( idx in tag) delete tag[idx]
}
}
}

 
I think something can be added to speed-up the code, maybe by using a function involved in the comparison step into the loop... I work it on...If you have some ideas, please fun !
 
Code:
BEGIN { FS = ";" ; ORS = "" }

{ if ( ! ($2 in gsms) )
  { gsms[ $2 ]
    gsm_seq[ ++gcount ] = $2
  }
  if ( ! ($1 in tags) )
  { tags[ $1 ]
    tag_seq[ ++tcount ] = $1
  }
  data[$1,$2] = $3
}

END {
  for (i=0; i <= tcount; i++)
  { tag = tag_seq[i]
    print i ? tag : "tag"
    for (j=1; j in gsm_seq; j++ )
    { gsm = gsm_seq[j]
      print ";" (i ? (0 + data[tag,gsm]) : gsm )
    }
    print "\n"
  }
}
 
thanx again to futurelet for this code.

this awk code is pretty reasonable but curiously it's not get a big amount speedups probably because elements in "data" array (inside the loop) are not deleted with delete statement with each turn of loop. Vlad do it in his code.

 
to Vlad -->
I dont why, but using gawk instead awk your code returns mismatch with duplicates lines !

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top