How to process file in parallel in a Perl script?

mackpei · Nov 5, 2006

Hi there,

I have a Perl script processing a lot of files in a directory.

If the files are processed in sequence, it would take much time. So one idea is to create several child processes, and each of them will process a part of the files.

My question is: Is there any easy way to group the files (their sizes can differ very much, and their format is binary), so that each child process gets similar data amount to process? This could then result in a maximum system workload and data throughput.

Thanks!
Mack

Kirsle · Nov 6, 2006

Code:

use threads;
use threads::shared;

opendir (DIR, "./directory");
my @files = sort(grep(!/^\./, readdir(DIR))));
closedir (DIR);

my $sections = [];
my $i = 0;
my $c = 0;
for (my $j = 0; $j <= scalar(@files); $j++) {
   $c++;

   if ($c >= 100) {
      $i++;
      $c = 0;
   }

   push (@{$sections->[$i]}, $files[$j]);
}

foreach my $job (@{$sections}) {
   threads->create (sub {
      #...
   });
}

Basic idea is: divide the array of files into smaller arrays, then create individual threads to process each array.

-------------
Kirsle.net | Kirsle's Programs and Projects

nbowden · Nov 6, 2006

Kirsle,

Does each instance of a thread start a new copy of the perl interpreter ?

This looks like a very cool solution. I've looked at using POE for similar things which is good as it uses a time-sliced single process and so only runs one instance of the perl interpreter.

I've never used threads, but have tried using fork a few times, but it eats memory as many instances of perl start up.

Which would be the best approach (in terms of memory utilization and speed) ?

Nigel Bowden

http://www.bowden-software.com

Kirsle · Nov 6, 2006

If you can get all the threads to run in the same Perl instance, that'll be the best way to do it. The threads module spawns a new instance of the Perl interpreter and copies all the currently loaded modules and variables at the time it was created. Threading doesn't take up a lot of memory if you do them wisely, ie. don't load 50 modules at the start of your script and then spawn off 10 threads if only one thread is going to be using 45 of those modules.

-------------
Kirsle.net | Kirsle's Programs and Projects

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to process file in parallel in a Perl script?

mackpei

Programmer

Kirsle

Programmer

nbowden

Technical User

Kirsle

Programmer

Similar threads

Part and Inventory Search

Sponsor