Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to process file in parallel in a Perl script?

Status
Not open for further replies.

mackpei

Programmer
Jun 6, 2000
27
DE
Hi there,

I have a Perl script processing a lot of files in a directory.

If the files are processed in sequence, it would take much time. So one idea is to create several child processes, and each of them will process a part of the files.

My question is: Is there any easy way to group the files (their sizes can differ very much, and their format is binary), so that each child process gets similar data amount to process? This could then result in a maximum system workload and data throughput.

Thanks!
Mack
 
Code:
use threads;
use threads::shared;

opendir (DIR, "./directory");
my @files = sort(grep(!/^\./, readdir(DIR))));
closedir (DIR);

my $sections = [];
my $i = 0;
my $c = 0;
for (my $j = 0; $j <= scalar(@files); $j++) {
   $c++;

   if ($c >= 100) {
      $i++;
      $c = 0;
   }

   push (@{$sections->[$i]}, $files[$j]);
}

foreach my $job (@{$sections}) {
   threads->create (sub {
      #...
   });
}

Basic idea is: divide the array of files into smaller arrays, then create individual threads to process each array.

-------------
Kirsle.net | Kirsle's Programs and Projects
 
Kirsle,

Does each instance of a thread start a new copy of the perl interpreter ?

This looks like a very cool solution. I've looked at using POE for similar things which is good as it uses a time-sliced single process and so only runs one instance of the perl interpreter.

I've never used threads, but have tried using fork a few times, but it eats memory as many instances of perl start up.

Which would be the best approach (in terms of memory utilization and speed) ?




Nigel Bowden
 
If you can get all the threads to run in the same Perl instance, that'll be the best way to do it. The threads module spawns a new instance of the Perl interpreter and copies all the currently loaded modules and variables at the time it was created. Threading doesn't take up a lot of memory if you do them wisely, ie. don't load 50 modules at the start of your script and then spawn off 10 threads if only one thread is going to be using 45 of those modules.

-------------
Kirsle.net | Kirsle's Programs and Projects
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top