Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

AWK and multiple intputs? 2

Status
Not open for further replies.

GusGrave

Programmer
Nov 17, 2010
41
SE
Back again! Finally the script seems to be working without a glitch and my coworkers seems pleased!

Thank you all for the help with this one! But, as always with efficiency, I want to be more efficient. I am also more lazy than the average person ;). So my question is this, in the modified version of this script below, where manual input for each analysis has been removed, is there a way to get AWK to spit out one "out1" and one "out2" file for each input if I run the script as:

./script.awk hb_results*

Using a wildcard to input all files at once? As it is now, I get one out1 and out2 named after the first file specified with the * AWK finds, though these two outputs contain the results from all the input files. I would like one out1 and one out2 for each file found with hb_results* (up to 30 or 40 files), named after each specific input found as specified in the script!

Is this possible?

Best regards
Gustaf


BEGIN{inp_num=10}
NR==1{out1="hb_%_occ_"FILENAME;out2="summary_"FILENAME;next}
NR==2{nr=(NF>8?12:8)}
NR<=nr{next}
{
gsub (/\(+|\)+|\:/," ")
}
{
if(NF>=15){
tott+=$15;++denom;tothb+=$10
printf "%10.2f %10.1f\n",$10,$15>out1
}


else if(NF>=11){
tothb+=$10;++denom
printf "%10.2f\n",$10>out1
}
}
END{
if(denom==0){
x="NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"
print x>out1;print x>out2;exit
}
close(out1);
if(tott>0){
avglt=tott/denom
while((getline<out1)>0)tottsq+=(($2-avglt)^2)
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.3f\n",tothb>out2
printf " Average Occupancy: %10.3f\n\n",avocc>out2
printf " Sum of lifetimes: %10.3f\n",tott>out2
printf " Average lifetime: %10.3f\n",avglt>out2
if(denom>1){
sd_lt=sqrt(tottsq/(denom-1));semlt=(tottsq/(denom-1))/(sqrt(denom))
printf " SD lifetime: %10.3f\n",sd_lt>out2
printf " SEM lifetime: %10.3f\n",semlt>out2
} else print " Single HBOND event, no SD or SEM calculation possible!">out2
}

if (tott==0){
avocc=tothb/inp_num
printf " Summary data for hbond analysis\n\n">out2
printf " Sum of Occupancy: %10.3f\n",tothb>out2
printf " Average Occupancy: %10.3f\n\n",avocc>out2
if(denom<1){ print " Single HBOND event, no SD or SEM calculation possible!">out2 }
}
}
 
Thinking a bit about it, should it not be possible to do this with some form of "foreach" argument regarding the FILENAME?

Best regards
// Gustaf
 
Hi

Which AWK implementation you can use ? GNU AWK has [tt]BEGINFILE[/tt] and [tt]ENDFILE[/tt] blocks since version 4.0. With that, you could replace [tt]NR[/tt] with [tt]FNR[/tt] and [tt]END[/tt] with [tt]ENDFILE[/tt]. Otherwise the best is to move the [tt]END[/tt] into a function and call it both on [tt]END[/tt] and on all [tt]FNR==1&&NR!=1[/tt].

Feherke.
 
Well, sadly there are multiple versions since I'm running it locally on my Mac (BSD version), on two different clusters/servers with different GAWK versions, one of which is only 3.1.5. So I guess BEGIN/ENDFILE is out of the question.

It is important that this script runs without interference/modifications on an as wide array of systems as possible since I'm a novice programmer at best, and other people implementing this script knows even less.

I will look into putting END into a function, see if I can figure it out. Otherwise I'll just have to do the work like everyone else, I guess two extra button pushes isn't really that much to complain about after already saving days of work ;)

Best regards
// Gustaf
 
Hi

And two more things : do not forget to reset the tott, tothb, denom counters and better explicitly [tt]close()[/tt] out2 too. So I would try it like this :
Code:
[b]BEGIN[/b] [teal]{[/teal] [navy]inp_num[/navy][teal]=[/teal][purple]10[/purple] [teal]}[/teal]
[navy]FNR[/navy][teal]==[/teal][purple]1[/purple] [teal]&&[/teal] NR[teal]!=[/teal][purple]1[/purple] [teal]{[/teal] [COLOR=darkgoldenrod]endfile()[/color][teal];[/teal] [navy]tott[/navy][teal]=[/teal][navy]tothb[/navy][teal]=[/teal][navy]denom[/navy][teal]=[/teal][purple]0[/purple] [teal]}[/teal]
[navy]FNR[/navy][teal]==[/teal][purple]1[/purple] [teal]{[/teal] [navy]out1[/navy][teal]=[/teal][green][i]"hb_%_occ_"[/i][/green]FILENAME[teal];[/teal] [navy]out2[/navy][teal]=[/teal][green][i]"summary_"[/i][/green]FILENAME[teal];[/teal] [b]next[/b] [teal]}[/teal]
[navy]FNR[/navy][teal]==[/teal][purple]2[/purple] [teal]{[/teal] [navy]nr[/navy][teal]=([/teal]NF[teal]>[/teal][purple]8[/purple][teal]?[/teal][purple]12[/purple][teal]:[/teal][purple]8[/purple][teal])[/teal] [teal]}[/teal]
FNR[teal]<=[/teal]nr [teal]{[/teal] [b]next[/b] [teal]}[/teal]
[teal]{[/teal]
  [b]gsub[/b] [teal]([/teal][fuchsia]/\(+|\)+|\:/[/fuchsia][teal],[/teal][green][i]" "[/i][/green][teal])[/teal]
[teal]}[/teal]
[teal]{[/teal]
  [b]if[/b] [teal]([/teal]NF[teal]>=[/teal][purple]15[/purple][teal])[/teal] [teal]{[/teal]
    tott[teal]+=[/teal][navy]$1[/navy][purple]5[/purple][teal];[/teal] [teal]++[/teal]denom[teal];[/teal] tothb[teal]+=[/teal][navy]$1[/navy][purple]0[/purple]
    [b]printf[/b] [green][i]"%10.2f %10.1f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal][navy]$1[/navy][purple]0[/purple][teal],[/teal][navy]$1[/navy][purple]5[/purple] [teal]>[/teal] out1
  [teal]}[/teal] [b]else[/b] [b]if[/b] [teal]([/teal]NF[teal]>=[/teal][purple]11[/purple][teal])[/teal] [teal]{[/teal]
    tothb[teal]+=[/teal][navy]$1[/navy][purple]0[/purple][teal];[/teal] [teal]++[/teal]denom
    [b]printf[/b] [green][i]"%10.2f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal][navy]$1[/navy][purple]0[/purple] [teal]>[/teal] out1
  [teal]}[/teal]
[teal]}[/teal]
[b]END[/b] [teal]{[/teal] [COLOR=darkgoldenrod]endfile()[/color] [teal]}[/teal]
[COLOR=darkgoldenrod]function endfile()[/color]
[teal]{[/teal]
  [b]if[/b] [teal]([/teal][navy]denom[/navy][teal]==[/teal][purple]0[/purple][teal])[/teal] [teal]{[/teal]
    [navy]x[/navy][teal]=[/teal][green][i]"NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"[/i][/green]
    [b]print[/b] x [teal]>[/teal] out1[teal];[/teal] [b]print[/b] x [teal]>[/teal] out2[teal];[/teal] [b]exit[/b]
  [teal]}[/teal]
  [b]close[/b][teal]([/teal]out1[teal])[/teal]
  [b]if[/b] [teal]([/teal]tott[teal]>[/teal][purple]0[/purple][teal])[/teal] [teal]{[/teal]
    [navy]avglt[/navy][teal]=[/teal]tott/denom
    [b]while[/b] [teal](([/teal][b]getline[/b][teal]<[/teal]out1[teal])>[/teal][purple]0[/purple][teal])[/teal] tottsq[teal]+=(([/teal][navy]$2[/navy]-avglt[teal])^[/teal][purple]2[/purple][teal])[/teal]
    [navy]avocc[/navy][teal]=[/teal]tothb/inp_num
    [b]printf[/b] [green][i]"   Summary data for hbond analysis[/i][/green][lime][i]\n\n[/i][/lime][green][i]"[/i][/green] [teal]>[/teal] out2
    [b]printf[/b] [green][i]"   Sum of Occupancy:      %10.3f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]tothb [teal]>[/teal] out2
    [b]printf[/b] [green][i]"   Average Occupancy:     %10.3f[/i][/green][lime][i]\n\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]avocc [teal]>[/teal] out2
    [b]printf[/b] [green][i]"   Sum of lifetimes:      %10.3f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]tott [teal]>[/teal] out2
    [b]printf[/b] [green][i]"   Average lifetime:      %10.3f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]avglt [teal]>[/teal] out2
    [b]if[/b] [teal]([/teal]denom[teal]>[/teal][purple]1[/purple][teal])[/teal] [teal]{[/teal]
      [navy]sd_lt[/navy][teal]=[/teal][COLOR=darkgoldenrod]sqrt[/color][teal]([/teal]tottsq[fuchsia]/(denom-1)); semlt=(tottsq/[/fuchsia][teal]([/teal]denom-[purple]1[/purple][teal]))/([/teal][COLOR=darkgoldenrod]sqrt[/color][teal]([/teal]denom[teal]))[/teal]
      [b]printf[/b] [green][i]"   SD lifetime:           %10.3f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]sd_lt [teal]>[/teal] out2
      [b]printf[/b] [green][i]"   SEM lifetime:          %10.3f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]semlt [teal]>[/teal] out2
    [teal]}[/teal] [b]else[/b] [b]print[/b] [green][i]"   Single HBOND event, no SD or SEM calculation possible!"[/i][/green] [teal]>[/teal] out2
  [teal]}[/teal]
  [b]if[/b] [teal]([/teal][navy]tott[/navy][teal]==[/teal][purple]0[/purple][teal])[/teal] [teal]{[/teal]
    [navy]avocc[/navy][teal]=[/teal]tothb/inp_num
    [b]printf[/b] [green][i]"   Summary data for hbond analysis[/i][/green][lime][i]\n\n[/i][/lime][green][i]"[/i][/green] [teal]>[/teal] out2
    [b]printf[/b] [green][i]"   Sum of Occupancy:      %10.3f[/i][/green][lime][i]\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]tothb [teal]>[/teal] out2
    [b]printf[/b] [green][i]"   Average Occupancy:     %10.3f[/i][/green][lime][i]\n\n[/i][/lime][green][i]"[/i][/green][teal],[/teal]avocc [teal]>[/teal] out2
    [b]if[/b] [teal]([/teal]denom[teal]<[/teal][purple]1[/purple][teal])[/teal] [teal]{[/teal] [b]print[/b] [green][i]"   Single HBOND event, no SD or SEM calculation possible!"[/i][/green] [teal]>[/teal] out2 [teal]}[/teal]
  [teal]}[/teal]
  [b]close[/b][teal]([/teal]out2[teal])[/teal]
[teal]}[/teal]
Vaguely tested with with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.
 
Thank you very much! I'll dive right into this!

I was just reading about the close function trying to figure out the output handling. So this was also what I was going to ask about! Though I have one more question, as you can see i have a "getline" which I forgot to erase, the:

{inp_num=10}

used to be (and still is for my co-workers) so one value used for division was specified by the user:

{t="/dev/tty";printf "Enter number of molecules to average: ">t;getline<t;inp_num=$1}

I guess I can remove the "getline" part all the way through, but anyway; should I close the "inp_num" in the "automatic" version and the "t" ("getline") in the manual version, or does AWK overwrite these specified values one restart/re-run over a new set of files?

Best regards
// Gustaf
 
Feherke, I'd replace this:
print x > out1; print x > out2; exit
with this:
print x > out1; print x > out2; return

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
PHV -> I'll try the modifications and see

Feherke -> It works beautifully on both clusters, going to try on my local machine as well!

 
This is just my lacking programming skills shining thorugh, in one scipt the inp_num was defined by the user upon execution of the script in the terminal view so that the user would not have to open and modify the script, now I just input the number directly in the script.

If i specify inp_num=10 and then change it to inp_num=40 and run on other files, is the value always replaced or should i "close()" the value/variable in the end of the script to avoid problems? I haven't experienced any errors in the calculations yet, but I would like to be shure.

Sorry for my poor explanation!

// Gus
 
Hi

You mean to
Code:
[teal]{[/teal][navy]t[/navy][teal]=[/teal][green][i]"/dev/tty"[/i][/green][teal];[/teal][b]printf[/b] [green][i]"Enter number of molecules to average: "[/i][/green][teal]>[/teal]t[teal];[/teal][b]getline[/b][teal]<[/teal]t[teal];[/teal][navy]inp_num[/navy][teal]=[/teal][navy]$1[/navy][teal];[/teal][highlight][b]close[/b][teal]([/teal]t[teal])[/teal][/highlight][teal]}[/teal]
? Well, you can do that too, but it is not so harmful anyway. The unused files has to be closed to avoid the risk of running out of available file descriptor. For out2 a new file is opened for each processed file. Closing the file allows the file descriptor to be reused. As t's value not changes, [tt]getline < t[/tt] keeps open only a single file, so not a significant waste.

Feherke.
 
Tanks a bundle for clarifying!

I have my weekend cut out for me making sure that the code works on all different workstations and servers!

I love this forum, everyone is so helpful! It is amazing! Lots of thumbs up for everyone helping me get more work done in a steadily decreasing amount of time!!

Best regards
// Gustaf
 
Good morning!

Well, after digging around, trying different input files and so on, there are two things that remain to solve.

PHV -> return instead of exit gives me:


/usr/bin/awk: hb_%_occ_hb_bot_n2_pab_h5.out makes too many open files
input record number 1, file hb_bot_o1_pab_h1.out
source line number 24

So then there must be something else that also needs to be modified.

feherke -> I guess this is coupled with the above, using exit (as suggested above) causes files without data (only text) not to be written which means that the "x="NO DATA POITS...." does not get printed into out1 and out2. Though it works for files containing >1 point.

Code:
FNR==1 && NR!=1 { endfile(); tott=tothb=denom=0 }
...
...
...
{
  if (denom==0) {
    x="NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"
    print x > out1; print x > out2; exit
  }

In files containing no data points, tott and tothb will be =0 so I guess there will be a "premature" termination? I'm not very familiar with boolean operators so I cannot really decipher FNR==1 && NR!=1, except for that if tott and tothb =0 then endfile.

I won't have as much tie this weak to work on this, but as usual any feedbakc is more than welcome, and thank you for all your previous input, both of you!

best regards
// Gustaf
 
Hi

PHV's suggestion for replacing [tt]exit[/tt] is mandatory. Otherwise the first empty file will stop the entire processing.

What is missing is another set of [tt]close()[/tt] calls, as the existing two [tt]close()[/tt]s are not reached when an empty file is met :
Code:
  [b]if[/b] [teal]([/teal][navy]denom[/navy][teal]==[/teal][purple]0[/purple][teal])[/teal] [teal]{[/teal]
    [navy]x[/navy][teal]=[/teal][green][i]"NO DATA POINTS IN INPUT => NO HYDROGEN BONDS DETECTED!"[/i][/green]
    [b]print[/b] x [teal]>[/teal] out1[teal];[/teal] [b]print[/b] x [teal]>[/teal] out2
    [highlight][b]close[/b][teal]([/teal]out1[teal]);[/teal] [b]close[/b][teal]([/teal]out2[teal])[/teal][/highlight]
    [b]return[/b]
  [teal]}[/teal]

Feherke.
 
Hi

Forgot this one :
Gustaf said:
I cannot really decipher FNR==1 && NR!=1
That means, the current file's current record number is 1 and the overall record number is not 1. In other words, is the first line of an input file, excepting the first file. That is when we need to run the endfile() and reset counters ( approximately ) between two input files.


Feherke.
 
Got it, so current record nr is 1, overall is not 1 and both criteria must be met! I guess I'm not a complete fool since there was some truth to the "premature" termination idea.

I tell ya', sometimes I think that a couple of courses in programming would have been of more use then the biomedicinal chemistry program for my current employment. You can be the happy amateur, watching iTunesU, google, reading wiki and books as much as you have time for and come up with some brilliant idea for making things better. Though when it's actually crunch-time, you need to understand what you are doing! It has become painfully clear that this is where I'm falling short since this "easy" script has offered so much resistance!

I'll work this into the script and give it a go! The good news is that it seems to be working (except for this last problem) with the BSD version as well as awk and gawk.

I can't thank you enough for you input! Hopefully I can repay it to someone, someday.

Best regards
// Gustaf
 
Now the "x=" message gets printed, though the script stopped after 21 input files:

/usr/bin/awk: summary_hb_bot_o2_pab_h5.out makes too many open files
input record number 1, file hb_bot_o3_pab_h1.out
source line number 33

So, even I could figure out that I was not closing all the files I needed to close ;)

By adding a "close(out1)" at the end of the file additional to the existing "close(out2)" it ran through all of the input files without complaining, so now I can only see two more things to do, find the "correct" placement for the close(out1) and checking that the math is still working as it is supposed to!

So, the last ti if functions:

Code:
}
  close(out1)
  if (tott==0) {
    avocc=tothb/inp_num
    printf "   Summary data for hbond analysis\n\n" > out2
    printf "   Sum of Occupancy:      %10.3f\n",tothb > out2
    printf "   Average Occupancy:     %10.3f\n\n",avocc > out2
    if (denom<1) { print "   Single HBOND event, no SD or SEM calculation possible!" > out2 }
 }

needed a "close(out1)" as well, now it treats every input file. The out1 files looks OK, Unfortunately, the calculations reported to out2 now gets distorted really bad. But I'll have to do some more digging until I can put my finger on exactly what is going wrong there...

Thanks again!
 
I found the effect, the sd_lt and semlt and or denom values gets screwed up somehow.

Code:
    if (denom>1) {
      sd_lt=sqrt(tottsq/(denom-1)); semlt=(tottsq/(denom-1))/(sqrt(denom))
      printf "   SD lifetime:           %10.3f\n",sd_lt > out2
      printf "   SEM lifetime:          %10.3f\n",semlt > out2
    } else print "   Single HBOND event, no SD or SEM calculation possible!" > out2


Now I need to fix the "cause". I guess I have to "reset" these to values between different input files because values are kept/carried between different input files. Possibly (to be safe) all of the defined values (xxx=yyy(*/=)zzz) regarding the "math" part?

Is there a way to "reset" these values or do I have to change how these are specified all together?

Best regards

 
Hi

Gustaf said:
Is there a way to "reset" these values or do I have to change how these are specified all together?
sd_lt and semlt are recalculated anyway, so only tottsq needs explicit reset. Just add it to in the line :
Code:
[navy]FNR[/navy][teal]==[/teal][purple]1[/purple] [teal]&&[/teal] NR[teal]!=[/teal][purple]1[/purple] [teal]{[/teal] [COLOR=darkgoldenrod]endfile()[/color][teal];[/teal] [navy]tott[/navy][teal]=[/teal][navy]tothb[/navy][teal]=[/teal][highlight][navy]tottsq[/navy][teal]=[/teal][/highlight][navy]denom[/navy][teal]=[/teal][purple]0[/purple] [teal]}[/teal]


Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top