Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to calculate standard deviation with AWK? 2

Status
Not open for further replies.

GusGrave

Programmer
Nov 17, 2010
41
SE
So, I've been working a bit more on my script from a previous post in order to extract even more data without going through excel. Now I've hit another bump and I'm hoping someone with more programming skills than me might help me solve this issue.

Below is the script in question:

#!/usr/bin/awk -f

BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
{
out1 = "hb_%_occ_" FILENAME;
out2 = "summary_" FILENAME;
gsub (/\(+|\)/," ");
tothb += $10;
tott += $15;
if (NF>=15) {++denom;printf "%10.2f %10.1f\n", $10, $15 > out1}
avgocc = tothb/inp_num;
}

END {
avglt = (tott/denom);
tottsq += (($15-avglt)**2);
print "Summary data for hbond analysis" > out2;
printf ("\n") > out2;
printf (" Sum of Occupancy: %10.2f\n", tothb) > out2;
printf (" Average Occypancy: %10.2f\n", avgocc) > out2;
printf ("\n") > out2;
printf (" Sum of lifetimes: %10.2f\n", tott) > out2;
printf (" Average lifetime: %10.2f\n", avglt) > out2;
printf (" SD lifetime: %10.2f\n", sqrt(tottsq/denom)) > out2;
}

The code not working is highlighted. As you might gather, the problem occurs with the sum of squared sums, I want to take each value in $15-avglt (the average value calculated), square these numbers and sum them up sum(($15-avglt)^2). This in order to divide this with the denom (number of rows containing numbers and not text) to get the STDEV.

I don't even know if this will work, as I mentioned in previous thread, the first 13 rows contains text in the input file, the last line is always just non-sense signs. I don't know if the script I have will ignore these lines when summing and calculating for $15, but this is my belief since I get an output number, just not the right value!

Any suggestions are more than welcome

Best regards
Gustaf
 
You have to first calculate the average and only then start the stddev calculation.
Say the awk program is called stddev.awk:
Code:
#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
{gsub (/\(+|\)/," ")}
NF<15{next}
NR==FNR{tott+=$15;++denom;next}
FNR==1{avglt=tott/denom}
{
  printf "%10.2f %10.1f\n",$10,$15>out1
  tottsq+=(($15-avglt)**2);tothb+=$10
}
END{
  printf "Summary data for hbond analysis\n\n">out2
  printf "   Sum of Occupancy:      %10.2f\n",tothb>out2
  printf "   Average Occupancy:     %10.2f\n\n",tothb/inp_num>out2
  printf "   Sum of lifetimes:      %10.2f\n",tott>out2
  printf "   Average lifetime:      %10.2f\n",avglt>out2
  printf "   SD lifetime:           %10.2f\n",sqrt(tottsq/denom)>out2
}
And you'll call it like this:
stddev.awk /path/to/input /path/to/input

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Another way, without reading the input 2 times:
Code:
#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
{gsub (/\(+|\)/," ")}
NF>=15{
  tott+=$15;++denom;tothb+=$10
  printf "%10.2f %10.1f\n",$10,$15>out1
}
END{
  close(out1);avglt=tott/denom
  while((getline<out1)>0)tottsq+=(($2-avglt)**2)
  printf "Summary data for hbond analysis\n\n">out2
  printf "   Sum of Occupancy:      %10.2f\n",tothb>out2
  printf "   Average Occupancy:     %10.2f\n\n",tothb/inp_num>out2
  printf "   Sum of lifetimes:      %10.2f\n",tott>out2
  printf "   Average lifetime:      %10.2f\n",avglt>out2
  printf "   SD lifetime:           %10.2f\n",sqrt(tottsq/denom)>out2
}

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
I'll try both and see which works best. Thank you...

My brain is starting to get really slow but I managed to figure out that i could not define new "arguments" for the input file analysis after the END command, so I did some remodelling:


#!/usr/bin/awk -f

BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
{
out1 = "hb_%_occ_" FILENAME;
out2 = "summary_" FILENAME;
gsub (/\(+|\)/," ");
tothb += $10;
tott += $15;
if (NF>=15) {++denom;printf "%10.2f %10.1f\n", $10, $15 > out1}
avgocc = tothb/inp_num;
tsqr += ($15^2);
}

END {
avglt = (tott/denom);
tottsq = (tsqr-(avglt^2));
print "Summary data for hbond analysis" > out2;
printf (" Kolla tottsq: %10.2f\n", tottsq) > out2;
printf ("\n") > out2;
printf (" Sum of Occupancy: %10.2f\n", tothb) > out2;
printf (" Average Occypancy: %10.2f\n", avgocc) > out2;
printf ("\n") > out2;
printf (" Sum of lifetimes: %10.2f\n", tott) > out2;
printf (" Average lifetime: %10.2f\n", avglt) > out2;
printf (" SD lifetime: %10.2f\n", sqrt(tottsq/denom)) > out2;
}

This isn't mathematically correct when calculating SD in the end, but I saw that the sum of sqared values from $15 was incorrect, I don't really know why, but I lost my enthusiasm for a couple of minutes and gave it a rest.

But I will try out these new scripts!

Thank you again for your help!

Best regards
Gustaf
 
For the SD you need to have the AVG first...

So you need to read the data twice, either through reading the file twice as in PHV's hints, or by storing the data in an array and stepping through that array in the END section.

HTH,

p5wizard
 
So, after only a minor re-organization I have the script that I so desperately wanted, I also added standard error of the mean calculation and it works like a charm. We don't really have any programmers at our university so thank god for this forum and all helpful people here.

Thank you so much PHV for you input, the second one worked smoothest so I went with that alternative, with only minor changes to get population SD and SEM. I could not figure out how AWK handled the two files and how to direct the "attention" to the right one, so a massive thumbs up for your input.

Thank you also p5wizard, I tried working with array but since I needed to work with a undefined number of lines after line 13 but excluding the last line I could not manage to get array sorting working in my favor. But I'm thankful for you input and explanation about AWK reading.

I'll include the script if anyone is interested in looking at the result! And again thank you both for your invaluable input!

Best regards
Gustaf


#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
{gsub (/\(+|\)/," ")}
NF>=15{
tott+=$15;++denom;tothb+=$10
printf "%10.2f %10.1f\n",$10,$15>out1
}
END{
close(out1);avglt=tott/denom
while((getline<out1)>0)tottsq+=(($2-avglt)^2)
avocc=tothb/inp_num
sd_lt=sqrt(tottsq/(denom-1))
semlt=(tottsq/(denom-1))/(sqrt(denom))
printf "Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
printf " Sum of lifetimes: %10.2f\n",tott>out2
printf " Average lifetime: %10.2f\n",avglt>out2
printf " SD lifetime: %10.2f\n",sd_lt>out2
printf " SEM lifetime: %10.2f\n",semlt>out2
}
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top