Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Sort program runs but is slow

Status
Not open for further replies.

ajmcello

Technical User
Apr 20, 2010
7
US
What it does:

runs through a file, sorts, and splits it alphabetically into files with words that are equal or less than 200.

In this example I'm using the dict file, and I use it several times, so excuse the redunancy since I didn't want to include the actual source files.

It works, but it takes a very long time.

Can someone help me get it working faster? Thanks much.

Gzip'd file:

Code:

Code:
#!/usr/bin/perl -w

use Text::CSV;
use File::Copy;
use File::stat;

use POSIX qw(strftime);

my $green_dir  = "tmp/list/green";
my $file_green = "tmp/all_green.txt";

my $blue_dir  = "tmp/list/blue";
my $file_blue = "tmp/all_blue.txt";

my $green_blue_dir  = "tmp/list/green_blue";
my $file_green_blue = "tmp/all_green_blue.txt";

my $allelse_dir  = "tmp/list/allelse";
my $file_allelse = "tmp/allelse.txt";

my $all_dir  = "tmp/list/all";
my $file_all = "tmp/all.txt";

my $max = 200;

$cnt  = 0;
$cnt2 = 2;

sub rem_green {
    $buf = "rm -f $green_dir/*";
    system($buf);
}

sub rem_blue {
    $buf = "rm -f $blue_dir/*";
    system($buf);
}

sub rem_green_blue {
    $buf = "rm -f $green_blue_dir/*";
    system($buf);
}

sub rem_allelse {
    $buf = "rm -f $allelse_dir/*";
    system($buf);
}

sub rem_all {
    $buf = "rm -f $all_dir/*";
    system($buf);
}

@files = (
    'A', 'C', 'B', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
    'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
);

@fl = (
    'A', 'C', 'B', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
    'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
    'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
);

if ( $ARGV[0] eq "-o" ) {
    $file_name = $file_blue;
    $file_dir  = $blue_dir;
    rem_blue();
}

if ( $ARGV[0] eq "-p" ) {
    $file_name = $file_green;
    $file_dir  = $green_dir;
    rem_green();
}

if ( $ARGV[0] eq "-po" ) {
    $file_name = $file_green_blue;
    $file_dir  = $green_blue_dir;
    rem_green_blue();
}

if ( $ARGV[0] eq "-a" ) {
    $file_name = $file_all;
    $file_dir  = $all_dir;
    rem_all();
}

if ( $ARGV[0] eq "-ae" ) {
    $file_name = $file_allelse;
    $file_dir  = $allelse_dir;
    rem_allelse();
}

$files_cnt = 0;
$fl_cnt    = 0;
foreach (@files) {
    $file = $_;
    $file .= "_1.txt";
    chomp($file);
    unlink("$file_dir/$file");
    open( IN, "<", "$file_name" );
    while (<IN>) {
        $word = $_;
        chomp($word);
        foreach (@fl) {
            $fl = $_;
            chomp($fl);
            if ( $word =~ /^[$fl]/ && $file =~ /^[$fl]/ ) {
                if ( $cnt == $max ) {
                    $file =~ s/_.*//;
                    $file .= "_$cnt2.txt";
                    $cnt2++;
                    $cnt = 0;
                    unlink("$file_dir/$file");
                }
                                print "$file_dir $file\n";
                open( OUT, ">>", "$file_dir/$file" ) or die $!;
                print OUT "$word\n";
                close(OUT);
                $cnt++;
                $fl_cnt++;
            }
            $fl_cnt = 0;
        }
        $files_cnt++;
    }
    $files_cnt = 0;
    $cnt       = 0;
    $cnt2      = 2;
}
close(IN);
 
I would imagine that is pretty slow, it's going through the entire dictionary file once for every element in @files. Given that, and all the files you're creating, the IO time is probably what's taking so long.

The dictionary file you supplied looks like it's somewhere around 1 MB - is that typical? Even if it's not, is there enough memory on the system to safely read all the data in? If so, it will be faster if you read through the dictionary file once and store all the words in a hash of arrays where the hash keys are the first character from each word (so that they match elements in @files.)

I might do something like this:
Code:
my %seen;
open( IN, "<", "$file_name" );
while (my $line = <IN>) {
	next if $line =~ /^\s*$/;	# Skip Blank Lines
	chomp $line;
	my $key = unpack 'a1', $line;
	push @{$seen{uc $key}}, $line;
}
close IN;

foreach (@files) {
	unless (defined $seen{$_}->[0]) {
		warn "No words matching pattern '$_'\n";
		next;
	}
	
	my $file_num = 0;
	my $count = 0;
	
	foreach my $word (sort {uc($a) cmp uc($b)} @{$seen{$_}}) {
		if ($count % $max == 0) {
			$file_num++;
			open OUT, "> $file_dir/${_}_${file_num}.txt";
		}
		print OUT $word, "\n";
		$count++;
	}
}
You might need to tweak that a bit depending on what you want to do with capitalization -- in other words, are 'A' and 'a' the same? The code above treats them that way.
 
Thanks rharsh. I appreciate the response. Let me work with that some.

I wasn't too worried about memory or CPU, the script is running on a quad core 3.2Ghz with 8gb of ram, its just painfully slow. :)

The biggest file I have to work with is about 100k, so much smaller than the example I provided.
 
just because all the repeated code was driving me nuts when I was looking at it before.. here, this might be of interest. :)
Code:
my ($file_name, $file_dir, %seen);
my $max = 200;
my @files = ('A'..'Z', 0..9);
my %options = (	'-o'	=>	{ 	'path' => 'tmp/list/green',
								'filename' => 'tmp/all_green.txt'		},
				'-p' 	=>	{	'path' => 'tmp/list/blue',
								'filename' => 'tmp/all_blue.txt'		},
				'-po'	=>	{	'path' => 'tmp/list/green_blue',
								'filename' => 'tmp/all_green_blue.txt'	},
				'-ae'	=>	{	'path' => 'tmp/list/allelse',
								'filename' => 'tmp/allelse.txt'			},
				'-a'	=>	{	'path' => 'tmp/list/all',
								'filename' => 'tmp/all.txt'				}
			  );
			  
unless (defined $ARGV[0] && defined $options{$ARGV[0]}->{path}) {
	die "Invalid or missing command line option.\n";
}
$file_name = $options{$ARGV[0]}->{filename};
$file_dir = $options{$ARGV[0]}->{path};
&rem_files($file_dir);

open( IN, "<", "$file_name" );
while (my $line = <IN>) {
    next if $line =~ /^\s*$/;    # Skip Blank Lines
    chomp $line;
    my $key = unpack 'A1', $line;
    push @{$seen{uc $key}}, unpack('A*', $line);
}
close IN;

foreach (@files) {
    unless (defined $seen{$_}->[0]) {
        warn "No words matching pattern '$_'\n";
        next;
    }
    
    my $file_num = 0;
    my $count = 0;
    
    foreach my $word (sort {uc($a) cmp uc($b)} @{$seen{$_}}) {
        if ($count % $max == 0) {
            $file_num++;
            open OUT, "> $file_dir/${_}_${file_num}.txt";
        }
        print OUT $word, "\n";
        $count++;
    }
}

sub rem_files {
	my $dir = shift || die "No path specified.";
	die "Bad path: $dir" unless -d $dir;
	system "rm -f ${dir}/*";
}
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top