manipulating huge files 2

eve25 · May 20, 2005

Hi guys!

I have to make an application for which the running time is really important, and to summarize I have 2 text files (one really huge: 80MB and the other of 1.6 MB). Those 2 files contain information for more than 1500 locations, and I have to produce 1 file per location with both of these files' data.
I have been using Perl for a little while now and I usually copy the file into an array to manipulate it (though I am just getting out of my engineering school so I still have a lot to learn...). So that's what I am doing for the 2 input files. (opening file, getting the data into an array with one line per cell, closing the file)
Then I am creating an array representing one output file and once the treatment is completed(I need to make some change to the data too once I get it ) I open a file, print the array, close the file...that makes a lot of I/O interactions... Maybe I must precise that the access is not sequential for now (the way I thought the algorithm) and I am doing that type of action to reach the data I want in the huge array:
.....

$lineNum=$specStartL+$timeStepNum*($nLocs*($nFreq+2)+1)+($pointN-1)*($nFreq+2);# to access the right line directly
$swanSpec[$lineNum]=~/(\S+)/;

....
# each line has many numbers I need to manipulate, so I am then creating a temporary array with those numbers in each cell for each line...

my @temp= split /\s+/,$swanSpec[$lineNum];
shift(@temp);
print "temp" if ($test==1);
#&printArray(\@temp) if ($test==1);
return \@temp;
.......

Though If you think it would be faster with sequential access I can maybe do that but I would have a version of each output array at the same time which would use a big amount of memory too...
Also those files gonna be twice as big in a few weeks...or I can maybe produce more of these type of input files and obviously smaller, would that help?
Would you have any idea on how to make it run faster?
Let me know if you need some more information about the code!
As I said I am just getting out of school and I have been learning how to code in Perl on my own in my various internships.. so I am not sure of what is real good perl code and what is not... would you have any web site to advise me too for efficiency programming?
Finally, I hope I have been clear enought and please forgive me my language mistakes, I am french...
Thanks a lot!
Have a wonderful day!

eve25 · May 20, 2005

Hey guys,
me again!..
I had made a stupid mistake and had put some 'sleep' command to get to see what was going on while debuging and I had forgotten to comment them out so that was what was taking time! ooops...
though, I'll be glad to know more about efficiency and I would still appreciate your insight if you think I am not using the appropriate method.

Thanks and sorry about that!
Eve

MikeLacey · May 23, 2005

Those sleep() commands do slow it down a bit ;-)

Mike

You cannot really appreciate Dilbert unless you've read it in the
original Klingon.

Want great answers to your Tek-Tips questions? Have a look at faq219-2884

stevexff · May 23, 2005

Eve

On the subject of performance when using large files, normally it's all about I/O reduction. I/O is orders of magnitude slower than memory access - a 3GHz processor can process 21 million instructions in the time it takes a 7ms HDD to get a record¹. So reading it all into an array is a good idea, if it fits without causing paging (see I/O, q.v.).

I'm guessing that the input files grow over time, and the output files summarise the information in the input files by location.

Assuming you have about 1500 locations, one possibility is to create a hash to hold the locations. Each location in the hash holds an array reference, which in turn points to the array of data you want to collect for each location.

The summary data for the locations will easily fit in memory, even if the number of locations grows a bit over time.

Read the input files sequentially with a while (<INPUT>) statement, and use the information in each record to update the summary information in the hash of arrays. Although this may seem like a lot of I/O, in reality the operating system will read a block of data from the disk into a buffer, and subsequent 'I/Os' will actually be fetched from the memory buffer until another 'real' read is required. Then it won't matter even if you have 80GB of input data, as you are only processing one row at a time.

When you've finished, just add a loop

Code:

foreach my $location (sort keys %locations) {
   open(LOCATION, ">$location.loc") or die "etc.";
   # your code to print the contents of $locations{$location} goes here
   close(LOCATION);
}

And obviously, make sure there are no sleep statements in the code. Note:it may take some time to live that one down...

¹Before any pedants jump in with 'what about interrupts, memory latency, pipeline flushes, etc., etc.' it was just an illustrative example. OK? [smile]

eve25 · Jun 9, 2005

Hi all!

Thanks guys for your replies...I have been wanting to try stevexff implementation choice once my program was in place and I could spend time on it but another issue was raised which avoid me to use my first choice (the files I am processing comes from a model and when there is no data instead of printing an empty matrice with the usuals header it just write NODATA, so it messes up my dynamiq search...)And I was thing as a consequance using hashes to represent my data even hashes of hashes... Before implementing anything I wanted to check what was the most efficient way to do it, and the searches I did to figure that out makes me more confused...

at one place I am reading that hashes are more efficient, at another one that:
`Hashes are ....Almost as fast and compact as regular arrays (15 percent more time, 40 percent more space).`

somewhere else (O reilly books..)
I should maybe use PDL
or .. `If the matrix is large and sparse (only a few elements have nonzero values), a hash of hashes is likely a more space-efficient representation. ` (which is apparently my case so I would be tempting to use that one...but do them mean not inserting the 0 value from the file?)
then `If the columns are sparse but the rows are well represented, you could choose an array of hashes structure.`
and finally 'The choice of data structure depends on the size of the matrices, performance, and coding convenience.`...

What do they mean by large, rows well represented? and basically the last sentence tells you you have to figure it out.. So I was wondering if one of you guy had a ressource to advise me to read to figure out what I should use.. And it is not my only issue, is it better for printing purpose to have a big array and print it on one shot, have smaller arrays and print them once at a time, use hashes....
I have a book about efficient programming in Perl but it doesn't deal with those type of issues...

Thanks a lot!
Have a great day!
Eve

MikeLacey · Jun 9, 2005

Hello again Eve,

There's only one way to be sure and that's to try it both (all?) ways with a repeatable and comparable test; I would suggest you time your tests with HiRes.

Unfortunately, for each statement like "Hashes are more efficient than an array for finding a single element" there's usually a bunch of exceptions and the *only* way of being sure is to try it. (An exception to that one by the way is "except when it's a "small" array which is just as quick to read through each time you want a value out of it. In this context "small" is a variable amount.)

Mike

You cannot really appreciate Dilbert unless you've read it in the
original Klingon.

Want great answers to your Tek-Tips questions? Have a look at faq219-2884

eve25 · Jun 9, 2005

Hi Mike,

All right, I was afraid to hear (read) that, thanks for your answer though..

I'll try that (hopefully without sleep statement

)and if I have explicit results, I'll submit a post!

Thanks!

TrojanWarBlade · Jun 9, 2005

Hi Eve,

The general rule for arrays versus hashes is that if you have sequential data then arrays are generally faster / better whereas if you need random access, hashes are generally faster / better.

The fly in the ointment is usually that you need your data accessable in different ways at different points in your code.

This can sometimes be achieved by using a hash as a lookup for an array (or vice versa). It means you need twice as many keys but you can still keep just one copy of the data itself.

There are numerous ways to attack the problem but it depends really on what exactly you are trying to achieve.

Trojan.

eve25 · Jun 13, 2005

Hi guys,

Thanks a lot for all your answers....
So I finally got the following complex data structure:

%locations{
$locCode => %loc
}

%loc{
coord => 2 cells array
windSpeed => 15 cells array
windDir => 15 cells array
spectra => reference to a 2d array (24*25)
}

(Before that structure I was using different arrays for each part with the indice corresponding to a code to refer to each location...)

while doing that it is using 50%-55% of my free memory, so I wanted to free the memory after printing the part of the hash in each 'locations file'...
I tried undef, then it appears delete works better, so I try

delete $locations{$loc};
delete @{$locations{$loc}}{"coord","windSpeed","windDir","spectra"};
delete @{$locations{$loc}{spectra}}{$count};

..and not only it was way longer to run but it didn't free any memory...
Would you have any idea on the way to do it or would it take anyway to much time compare to the time we can gain with having more memory available....

I also created different suroutines to test the best printing way and it seems to be :
affecting my dereference my 2d spectra array to another array (rather than grabing it directly from the hash element by element) and affect an array with the result and then print the whole array on one shot (rather than printing directly with the hash element...). Does that seem reasonable to you?

Thanks a lot!
I really feel lucky to have you guys around!If I could I would invite you to have a drink!...

have a great week!
Eve

TrojanWarBlade · Jun 14, 2005

Hi Eve,

I would just like to point out an issue that may be relevant.

If you are having memory problems and performace issues, you should try to avoid constructs like this: "foreach my $location (sort keys %locations) {"
The problem is that you are creating multiple copies of data when you process things that way. First you are creating a temporary list of keys of %locations so the keys are duplicated and secondly you should ask yourself if you actually need the data sorted since this can cost you in processing time.

To replace, you can use the "each" function that will give you a key value pair directly, sequentially from the hash without duplicating any more than one pair at a time (the keys list will be active in memory for the complete duration of the loop).

my ($key, $value);
while(($key, $value) = each %locations) {

Hope this helps a little

Trojan.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

manipulating huge files 2

eve25

Programmer

eve25

Programmer

MikeLacey

MIS

stevexff

Programmer

eve25

Programmer

MikeLacey

MIS

eve25

Programmer

TrojanWarBlade

Programmer

eve25

Programmer

TrojanWarBlade

Programmer

Similar threads

Part and Inventory Search

Sponsor