Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Quick help evaluating awk script 2

Status
Not open for further replies.

GusGrave

Programmer
Nov 17, 2010
41
SE
Hello everyone!

Its been a few months since last time I posted. I created a script for AWK to aid in data analysis from some analyses we (me and co-workers) perform in connection to MD simulations. There was a minor problem that I solved using two different scripts, depending on which mode the analyses is performed you get different amounts of columns, which means that one script did not work on both analysis outputs. I did not find this to be a overwhelming problem, but it seems that some found it aggravating to keep track of which analysis method had been used to produce the AWK input file.

Today I finally had some time to spare, so I tried to solve this and create a "universal" AWK script (universal being a little over the top since there are only two different layouts of input files).

If any of you are at all interested and have a few minutes over I would really appreciate any feedback on this script. Does it look OK, is there a better way to solve the problem, anything that can be improved and so forth...

Best regards
//Gustaf


#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
{gsub (/\(+|\)/," ")}
{if(NF>=15){
tott+=$15;++denom;tothb+=$10
printf "%10.2f %10.1f\n",$10,$15>out1
}
}
{if(NF>=11){
tothb+=$10;++denom
printf "%10.2f\n",$10>out1
}
}
END{
if(denom==0){
x="NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"
print x>out1;print x>out2;exit
}
close(out1);
{if(tott>0){
avglt=tott/denom
while((getline<out1)>0)tottsq+=(($2-avglt)^2)
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
printf " Sum of lifetimes: %10.2f\n",tott>out2
printf " Average lifetime: %10.2f\n",avglt>out2
if(denom>1){
sd_lt=sqrt(tottsq/(denom-1));semlt=(tottsq/(denom-1))/(sqrt(denom))
printf " SD lifetime: %10.2f\n",sd_lt>out2
printf " SEM lifetime: %10.2f\n",semlt>out2
} else print " Single HBOND event, no SD or SEM calculation possible!">out2
}
}
{if (tott==0){
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.2f\n",tothb>out2
printf " Average Occupancy: %10.2f\n\n",avocc>out2
if(denom<1){ print " Single HBOND event, no SD or SEM calculation possible!">out2 }
}
}
}
 
Not knowing what data you're processing I can't really comment on the functionality, but from a coding style perspective some tidying up could be done, mostly by removing the unnecessary extra braces around "if" statements and tidier indentation, something like this:

Code:
#!/usr/bin/awk -f
BEGIN{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME}
{
        gsub (/\(+|\)/," ")
        if(NF>=15){
                tott+=$15;++denom;tothb+=$10
                printf "%10.2f %10.1f\n",$10,$15>out1
        }

        if(NF>=11){
                tothb+=$10;++denom
                printf "%10.2f\n",$10>out1
        }
}
END{
        if(denom==0){
                x="NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"
                print x>out1;print x>out2;exit
        }
        close(out1);
        if(tott>0){
                avglt=tott/denom
                while((getline<out1)>0)tottsq+=(($2-avglt)^2)
                avocc=tothb/inp_num
                printf "   Summary data for hbond analysis\n\n">out2
                printf "   Sum of Occupancy:      %10.2f\n",tothb>out2
                printf "   Average Occupancy:     %10.2f\n\n",avocc>out2
                printf "   Sum of lifetimes:      %10.2f\n",tott>out2
                printf "   Average lifetime:      %10.2f\n",avglt>out2
                if(denom>1){
                        sd_lt=sqrt(tottsq/(denom-1));semlt=(tottsq/(denom-1))/(sqrt(denom))
                        printf "   SD lifetime:           %10.2f\n",sd_lt>out2
                        printf "   SEM lifetime:          %10.2f\n",semlt>out2
                } else print "   Single HBOND event, no SD or SEM calculation possible!">out2
        }

        if (tott==0){
                avocc=tothb/inp_num
                printf "   Summary data for hbond analysis\n\n">out2
                printf "   Sum of Occupancy:      %10.2f\n",tothb>out2
                printf "   Average Occupancy:     %10.2f\n\n",avocc>out2
                if(denom<1){ print "   Single HBOND event, no SD or SEM calculation possible!">out2 }
        }
}

Annihilannic.
 
Hi

I agree, the extra enclosing braces ( {} ) are better removed. But in addition I would also remove the explicit [tt]if[/tt] statements too, leaving the conditions as patterns ( or addresses in Sed terminology ) :
Code:
[teal]{[/teal]
        [b]gsub[/b] [teal]([/teal][fuchsia]/\(+|\)/[/fuchsia][teal],[/teal][green][i]" "[/i][/green][teal])[/teal]
[teal]}[/teal]
NF[teal]>=[/teal][purple]15[/purple] [teal]{[/teal]
        tott[teal]+=[/teal][navy]$1[/navy][purple]5[/purple][teal];++[/teal]denom[teal];[/teal]tothb[teal]+=[/teal][navy]$1[/navy][purple]0[/purple]
        [b]printf[/b] [green][i]"%10.2f %10.1f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal][navy]$1[/navy][purple]0[/purple][teal],[/teal][navy]$1[/navy][purple]5[/purple][teal]>[/teal]out1
[teal]}[/teal]
NF[teal]>=[/teal][purple]11[/purple] [teal]{[/teal]
        tothb[teal]+=[/teal][navy]$1[/navy][purple]0[/purple][teal];++[/teal]denom
        [b]printf[/b] [green][i]"%10.2f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal][navy]$1[/navy][purple]0[/purple][teal]>[/teal]out1
[teal]}[/teal]


Feherke.
 
Thank you both for you feedback!

I agree, it is sloppy code so I will definitely clean it up a bit!

Annihilannic: Is you are interested in checking functionality I can upload some input files somewhere, or send them by email (is probably easier). Ill just paste in the first few lines of an input file for the AWK script:

HBOND SUMMARY:
Data was saved to series 10, output to file hb_XLI_OAE_MAA_HAA.out, time interval is 0.50

data was sorted, intra-residue interactions are NOT included,
Distance cutoff is 3.00 angstroms, angle cutoff is 120.00 degrees
Hydrogen bond information dumped for occupancies > 0.00

Dumping schematic of time series after each h-bond, key follows:
| . - o x * @ |
0-5% 5-20% 20-40% 40-60% 60-80% 80-95% 95-100% occupancy

DONOR ACCEPTORH ACCEPTOR
atom# :res@atom atom# :res@atom atom# :res@atom %occupied distance angle lifetime maxocc
| 7262 :260@OAE | 11331 :411@HAA 11330 :411@OAC | 95.91 2.720 ( 0.11) 18.18 ( 9.60) 12.8 ( 25.8) 307 |@@@@*@@@@*@*@@@@@**@@@@@@@@@@@@@@@@@@@@@@@********|
...
...
...

---------------- --------------------------------- -------------------------------------

This contains only one data line (everything between | 7262 and ***| on one line), usually there are between 10 and a couple of hundred lines. There are three alternative outputs from these analyses. The one above is the most commonly used here which utilizes a "series" command to get a graphic representation of when interactions take place (the long line with @:s and *:s) I've made the columns of interest bold

The second type dos not have the last field and lifetime calculations, so there's only one coloumn of interrest:

DONOR ACCEPTORH ACCEPTOR
atom# :res@atom atom# :res@atom atom# :res@atom %occupied distance angle
| 7262 :260@OAE | 11331 :411@HAA 11330 :411@OAC | 95.91 2.720 ( 0.11) 18.18 ( 9.60)
---------------- --------------------------------- -------------------------------------

The third output is when no interaction takes place:

DONOR ACCEPTORH ACCEPTOR
atom# :res@atom atom# :res@atom atom# :res@atom %occupied distance angle lifetime maxocc
---------------- --------------------------------- -------------------------------------

Thank you both again for you help!

Best regards
// Gustaf
 
So, obviously I found some problems when doing some more testing. Some calculations regarding lifetime gets screwed-up when NF>=11. Also the alternative outputs "NO HYDROGEN BONDS DETECTED" and "ONLY ONE EVENT...."

Two of the initial text lines contains 11 fields which introduces two 0.00 lines that screws up the calculations. And since these two lines exist, there will always be an interpretation that there are more than 2 events registered.


I guess I have to find a better way to define which file type is involved.

Best regards
// Gustaf
 
I guess one option would which should be fairly crude but simple would be to delete/not read the first 13 lines, then the text lines fulfilling the NF criteria would be eliminated and the problem would be solved!

Do any of you have a appropriate line/command and a suitable placement to be able to do this?

Best regards
Gustaf
 
not read the first 13 lines
At the beginning of the program:
NR<=13{next}

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top