Sort program runs but is slow

ajmcello · May 8, 2010

What it does:

runs through a file, sorts, and splits it alphabetically into files with words that are equal or less than 200.

In this example I'm using the dict file, and I use it several times, so excuse the redunancy since I didn't want to include the actual source files.

It works, but it takes a very long time.

Can someone help me get it working faster? Thanks much.

Gzip'd file:

http://www.mediafire.com/download.php?zykjlzmktjn

Code:

#!/usr/bin/perl -w

use Text::CSV;
use File::Copy;
use File::stat;

use POSIX qw(strftime);

my $green_dir  = "tmp/list/green";
my $file_green = "tmp/all_green.txt";

my $blue_dir  = "tmp/list/blue";
my $file_blue = "tmp/all_blue.txt";

my $green_blue_dir  = "tmp/list/green_blue";
my $file_green_blue = "tmp/all_green_blue.txt";

my $allelse_dir  = "tmp/list/allelse";
my $file_allelse = "tmp/allelse.txt";

my $all_dir  = "tmp/list/all";
my $file_all = "tmp/all.txt";

my $max = 200;

$cnt  = 0;
$cnt2 = 2;

sub rem_green {
    $buf = "rm -f $green_dir/*";
    system($buf);
}

sub rem_blue {
    $buf = "rm -f $blue_dir/*";
    system($buf);
}

sub rem_green_blue {
    $buf = "rm -f $green_blue_dir/*";
    system($buf);
}

sub rem_allelse {
    $buf = "rm -f $allelse_dir/*";
    system($buf);
}

sub rem_all {
    $buf = "rm -f $all_dir/*";
    system($buf);
}

@files = (
    'A', 'C', 'B', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
    'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
);

@fl = (
    'A', 'C', 'B', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
    'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
);

if ( $ARGV[0] eq "-o" ) {
    $file_name = $file_blue;
    $file_dir  = $blue_dir;
    rem_blue();
}

if ( $ARGV[0] eq "-p" ) {
    $file_name = $file_green;
    $file_dir  = $green_dir;
    rem_green();
}

if ( $ARGV[0] eq "-po" ) {
    $file_name = $file_green_blue;
    $file_dir  = $green_blue_dir;
    rem_green_blue();
}

if ( $ARGV[0] eq "-a" ) {
    $file_name = $file_all;
    $file_dir  = $all_dir;
    rem_all();
}

if ( $ARGV[0] eq "-ae" ) {
    $file_name = $file_allelse;
    $file_dir  = $allelse_dir;
    rem_allelse();
}

$files_cnt = 0;
$fl_cnt    = 0;
foreach (@files) {
    $file = $_;
    $file .= "_1.txt";
    chomp($file);
    unlink("$file_dir/$file");
    open( IN, "<", "$file_name" );
    while (<IN>) {
        $word = $_;
        chomp($word);
        foreach (@fl) {
            $fl = $_;
            chomp($fl);
            if ( $word =~ /^[$fl]/ && $file =~ /^[$fl]/ ) {
                if ( $cnt == $max ) {
                    $file =~ s/_.*//;
                    $file .= "_$cnt2.txt";
                    $cnt2++;
                    $cnt = 0;
                    unlink("$file_dir/$file");
                }
                                print "$file_dir $file\n";
                open( OUT, ">>", "$file_dir/$file" ) or die $!;
                print OUT "$word\n";
                close(OUT);
                $cnt++;
                $fl_cnt++;
            }
            $fl_cnt = 0;
        }
        $files_cnt++;
    }
    $files_cnt = 0;
    $cnt       = 0;
    $cnt2      = 2;
}
close(IN);

rharsh · May 9, 2010

I would imagine that is pretty slow, it's going through the entire dictionary file once for every element in @files. Given that, and all the files you're creating, the IO time is probably what's taking so long.

The dictionary file you supplied looks like it's somewhere around 1 MB - is that typical? Even if it's not, is there enough memory on the system to safely read all the data in? If so, it will be faster if you read through the dictionary file once and store all the words in a hash of arrays where the hash keys are the first character from each word (so that they match elements in @files.)

I might do something like this:

Code:

my %seen;
open( IN, "<", "$file_name" );
while (my $line = <IN>) {
	next if $line =~ /^\s*$/;	# Skip Blank Lines
	chomp $line;
	my $key = unpack 'a1', $line;
	push @{$seen{uc $key}}, $line;
}
close IN;

foreach (@files) {
	unless (defined $seen{$_}->[0]) {
		warn "No words matching pattern '$_'\n";
		next;
	}
	
	my $file_num = 0;
	my $count = 0;
	
	foreach my $word (sort {uc($a) cmp uc($b)} @{$seen{$_}}) {
		if ($count % $max == 0) {
			$file_num++;
			open OUT, "> $file_dir/${_}_${file_num}.txt";
		}
		print OUT $word, "\n";
		$count++;
	}
}

You might need to tweak that a bit depending on what you want to do with capitalization -- in other words, are 'A' and 'a' the same? The code above treats them that way.

ajmcello · May 9, 2010

Thanks rharsh. I appreciate the response. Let me work with that some.

I wasn't too worried about memory or CPU, the script is running on a quad core 3.2Ghz with 8gb of ram, its just painfully slow.

The biggest file I have to work with is about 100k, so much smaller than the example I provided.

rharsh · May 10, 2010

just because all the repeated code was driving me nuts when I was looking at it before.. here, this might be of interest.

Code:

my ($file_name, $file_dir, %seen);
my $max = 200;
my @files = ('A'..'Z', 0..9);
my %options = (	'-o'	=>	{ 	'path' => 'tmp/list/green',
								'filename' => 'tmp/all_green.txt'		},
				'-p' 	=>	{	'path' => 'tmp/list/blue',
								'filename' => 'tmp/all_blue.txt'		},
				'-po'	=>	{	'path' => 'tmp/list/green_blue',
								'filename' => 'tmp/all_green_blue.txt'	},
				'-ae'	=>	{	'path' => 'tmp/list/allelse',
								'filename' => 'tmp/allelse.txt'			},
				'-a'	=>	{	'path' => 'tmp/list/all',
								'filename' => 'tmp/all.txt'				}
			  );
			  
unless (defined $ARGV[0] && defined $options{$ARGV[0]}->{path}) {
	die "Invalid or missing command line option.\n";
}
$file_name = $options{$ARGV[0]}->{filename};
$file_dir = $options{$ARGV[0]}->{path};
&rem_files($file_dir);

open( IN, "<", "$file_name" );
while (my $line = <IN>) {
    next if $line =~ /^\s*$/;    # Skip Blank Lines
    chomp $line;
    my $key = unpack 'A1', $line;
    push @{$seen{uc $key}}, unpack('A*', $line);
}
close IN;

foreach (@files) {
    unless (defined $seen{$_}->[0]) {
        warn "No words matching pattern '$_'\n";
        next;
    }
    
    my $file_num = 0;
    my $count = 0;
    
    foreach my $word (sort {uc($a) cmp uc($b)} @{$seen{$_}}) {
        if ($count % $max == 0) {
            $file_num++;
            open OUT, "> $file_dir/${_}_${file_num}.txt";
        }
        print OUT $word, "\n";
        $count++;
    }
}

sub rem_files {
	my $dir = shift || die "No path specified.";
	die "Bad path: $dir" unless -d $dir;
	system "rm -f ${dir}/*";
}

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Sort program runs but is slow

ajmcello

Technical User

rharsh

Technical User

ajmcello

Technical User

rharsh

Technical User

Similar threads

Part and Inventory Search

Sponsor