Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

GNU Awk memory allocation

Status
Not open for further replies.

Romeu

MIS
Jun 25, 2002
9
PT
I'm having problems with a Gawk script that looks something like this:

-----------------------------------------
while ( (getline < input_file) > 0 ) {

# skip file header
n_nr++
if ( n_nr < 3 ) continue

summary_detail_array=$2&quot;|&quot;$3

summary_detail[summary_detail_array]+=$1

}

for ( item in summary_detail ) print summary_detail[item],item > summary_output_file
-----------------------------------------


The input file has over 20 million lines.
When I run the script, it aborts due to insufficient memory (&quot;fatal: newnode: nextfree: can't allocate memory (Not enough space)&quot;). Although the array has only 5000 elements (10 million input linres read), Gawk is using more than 2GB of RAM, at the time of the abort!
I've tried running the same script in Awk and the same thing happens (Awk however seems to use less memory).
My only explanation for this is that Gawk always adds a new element to the array, even when you are just updating an existing element. The old element is deleted, but the memory is not deallocated.
Does this make sense? If so, is there any workaround?

Thank you very much,
Romeu
 
the probl is the way awk works. it is splitting and remembering the whole file at startup time.
try a cheaper tool like sed and tmps-files
(sed cannot compute but is very performant on strings) -----------
when they don't ask you anymore, where they are come from, and they don't tell you anymore, where they go ... you'r getting older !
 
I don't know of a lot of tools that are going to handle
input files of that size for the kind of processing gawk
does.
Remember the awk array is a hash table and deleting an
awk element doesn't carry quite the same connotation as freeing memory from a conventional C array.

You could try something like setting an array maxsz &
loading till you reach the threshold and then dumping the contents to a temp file, deleting the array and continuing
from that point.

Say:
maxsz = 150
while ((getline < fname) > 0) {
array[$2&quot;|&quot;$3] += $1
p++
if (p >= maxsz) {
dumparray(array,tmpfile)
p = destroy(array)
}
}

You may also want to check with the experts at c.l.a
on this and see if there is a workaround.
 
Thank you both for your input.
Marsd, regarding your solution, I have a small concern. How would destroy(array) work? I've had some problems with the 'delete [array]' command in gawk (or the original awk method where you delete the array's elements one by one). They do not seem to deallocate the memory used by the array. Is there another method for destroying an array?
Finally, and this is probably a dumb question, but...what's c.l.a? :)


Again, thank you,
Romeu

 
c.l.a is comp.lang.awk
The maintainer and several other contributors answer questions there.
You could use the google portal to
the newsgroups.


My method is what you describe you are having trouble with:

function freeArray(array, i) {
for (i in array) {
delete array
}
return
}

If this doesn't seem to work you might be able to find a workaround at c.l.a
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top