I've posted regarding this script before but have not been working with it for a while. Just started again today and now I realize that I have to use a new approach to my problem. In order for this to be useful to anyone else but me I have to minimize the amount of "individual thinking" needed to use the script. It has to be able to handle 2 different kinds of input files and differentiate between them (three actually since one input will not have any data lines, only text lines), in contrast to the separate two scripts I already have working for each type of input, unfortunately it has become apparent that it is hard to keep track of which input you feed this script and this is causing faulty results and confusion.
Below is the script I'm working with:
-------------------------------------------------------------------------------
#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
NR<=13{next}
{
gsub (/\(+|\)/," ")
}
{
if(NF>=15){ #1
tott+=$15;++denom;tothb+=$10
printf "%10.2f %10.1f\n",$10,$15>out1
}
else if(NF>=11){ #2
tothb+=$10;++denom
printf "%10.2f\n",$10>out1
}
}
END{
if(denom==0){
x="NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"
print x>out1;print x>out2;exit
}
close(out1);
if(tott>0){
avglt=tott/denom
while((getline<out1)>0)tottsq+=(($2-avglt)^2)
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
printf " Sum of lifetimes: %10.2f\n",tott>out2
printf " Average lifetime: %10.2f\n",avglt>out2
if(denom>1){
sd_lt=sqrt(tottsq/(denom-1));semlt=(tottsq/(denom-1))/(sqrt(denom))
printf " SD lifetime: %10.2f\n",sd_lt>out2
printf " SEM lifetime: %10.2f\n",semlt>out2
} else print " Single HBOND event, no SD or SEM calculation possible!">out2
}
if (tott==0){
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
if(denom<1){ print " Single HBOND event, no SD or SEM calculation possible!">out2 }
}
}
----------------------------------------------------------------------------------
I don't know why, but by adding the "else if" instead of just "if" at line 12 (#2) the "math-part" almost works, but there is two problems left. The NR<=13{next} (the red text) works only for one type of input (#1), loosing 4 data lines of the other (#2), by decreasing to 9, an extra line of zeroes is included in the output for one input type (#1) which is screwing up the "Average Lifetime" calculation since this counts the zero line as an event. This since one text lines before the data actually contains NF>=15. For #1 and #2 I want the code to recognize if it should start reading data at row 14 (for #1) or at row 10 (for #2)
The way I figure (which might be way off) is that I have three alternatives:
1) Applying the NR<=13{next} after the "if(NF>=15)" and a NR<=9{next} after "if(NF>=11)", but the I get:
>>> NR<=13{ <<< next}
... illegal statement at source line ...
So it seems I cannot apply this filter after the "ifs"/or I am not applying them in a correct manner.
2) Search the second row/line (NR=2) (or the entire document) for the word "series" (which only occurs once in one of the input file types (#1)), then use this as an argument/variable; if "series" is found then apply "if(NF>=15) ...." else (/if not) apply "if(NF>=11)..."
3) (which is just a modification/simplification? of 2)) If the second line of the input has more than 8 fields than apply "(if(NF>=15)..." otherwise apply "(if(NF>11)..." But I cannot figure out how to combine if(NR=2 and NF>=8){} as a single filter?
So the first issue is to get the code to start reading at different rows regarding of the format of the input. The second problem is tho get the code NOT to read the last line of the input, I guess this should be something like NR<=$NR, though I don't know how to actually include this into the code in a good way. Including the last line also adds an extra row of zeros to out1 and messes up the average calculation in out2 since these added lines is counted as events.
As you probably notice, I am not a very experienced programmer and this might cause some problems with my scripts and how I try to formulate my problem/questions. Still, I hope that someone can help we with some suggestions on how to work option 1,2 or 3 into the script so that it can differentiate between the different types of input and exclude the last line of input from calculations to produce accurate results without having to think about which input type you have.
Best regards
//Gustaf
Below is the script I'm working with:
-------------------------------------------------------------------------------
#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
NR<=13{next}
{
gsub (/\(+|\)/," ")
}
{
if(NF>=15){ #1
tott+=$15;++denom;tothb+=$10
printf "%10.2f %10.1f\n",$10,$15>out1
}
else if(NF>=11){ #2
tothb+=$10;++denom
printf "%10.2f\n",$10>out1
}
}
END{
if(denom==0){
x="NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"
print x>out1;print x>out2;exit
}
close(out1);
if(tott>0){
avglt=tott/denom
while((getline<out1)>0)tottsq+=(($2-avglt)^2)
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
printf " Sum of lifetimes: %10.2f\n",tott>out2
printf " Average lifetime: %10.2f\n",avglt>out2
if(denom>1){
sd_lt=sqrt(tottsq/(denom-1));semlt=(tottsq/(denom-1))/(sqrt(denom))
printf " SD lifetime: %10.2f\n",sd_lt>out2
printf " SEM lifetime: %10.2f\n",semlt>out2
} else print " Single HBOND event, no SD or SEM calculation possible!">out2
}
if (tott==0){
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
if(denom<1){ print " Single HBOND event, no SD or SEM calculation possible!">out2 }
}
}
----------------------------------------------------------------------------------
I don't know why, but by adding the "else if" instead of just "if" at line 12 (#2) the "math-part" almost works, but there is two problems left. The NR<=13{next} (the red text) works only for one type of input (#1), loosing 4 data lines of the other (#2), by decreasing to 9, an extra line of zeroes is included in the output for one input type (#1) which is screwing up the "Average Lifetime" calculation since this counts the zero line as an event. This since one text lines before the data actually contains NF>=15. For #1 and #2 I want the code to recognize if it should start reading data at row 14 (for #1) or at row 10 (for #2)
The way I figure (which might be way off) is that I have three alternatives:
1) Applying the NR<=13{next} after the "if(NF>=15)" and a NR<=9{next} after "if(NF>=11)", but the I get:
>>> NR<=13{ <<< next}
... illegal statement at source line ...
So it seems I cannot apply this filter after the "ifs"/or I am not applying them in a correct manner.
2) Search the second row/line (NR=2) (or the entire document) for the word "series" (which only occurs once in one of the input file types (#1)), then use this as an argument/variable; if "series" is found then apply "if(NF>=15) ...." else (/if not) apply "if(NF>=11)..."
3) (which is just a modification/simplification? of 2)) If the second line of the input has more than 8 fields than apply "(if(NF>=15)..." otherwise apply "(if(NF>11)..." But I cannot figure out how to combine if(NR=2 and NF>=8){} as a single filter?
So the first issue is to get the code to start reading at different rows regarding of the format of the input. The second problem is tho get the code NOT to read the last line of the input, I guess this should be something like NR<=$NR, though I don't know how to actually include this into the code in a good way. Including the last line also adds an extra row of zeros to out1 and messes up the average calculation in out2 since these added lines is counted as events.
As you probably notice, I am not a very experienced programmer and this might cause some problems with my scripts and how I try to formulate my problem/questions. Still, I hope that someone can help we with some suggestions on how to work option 1,2 or 3 into the script so that it can differentiate between the different types of input and exclude the last line of input from calculations to produce accurate results without having to think about which input type you have.
Best regards
//Gustaf