Parse large file and then go back through the files again

MightyJayDog · Jun 5, 2007

I have a large data file that I got some help from brigmar to split into smaller more manageable chunks (went from a 12.86 GB file to 500 MB - 1.6 GB chunks).

I now want to add to the PERL script and go back through those chunks and pull out any invoices within those smaller data files that are larger than 250 MB each and print them to their own files as well.

How do I go about doing that?

Here is what I am currently working with...

Code:

#!usr/bin/perl -w

my $chunksize = 500000000; # 500MB
my $filenumber = 0;
my $infile = "infile.dat";
my $outsize = 0;
my $eof = 0;

open INFILE, $infile;
open OUTFILE, ">outfile_".$filenumber.".dat";

while(<INFILE>)
{
        chomp;

        if( $outsize>$chunksize and /^.{67}11/ )
        {
                close OUTFILE;
                $outsize = 0;
                $filenumber++;
                open (OUTFILE, ">outfile_".$filenumber.".dat") or die "Can't open outfile_".$filenumber.".dat";
        }

        print OUTFILE "$_\n";
        $outsize += length;

}
close INFILE;

travs69 · Jun 5, 2007

What code changes have you tried? That looks like the code from your last question. You should try and figure it out on you own first

KevinADC · Jun 5, 2007

I see this question on several forums now.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

MightyJayDog · Jun 5, 2007

travs69 - thanks for the help

Yeah I posted it on other forums hoping that someone might actually help.

I didn't want to muck up the question with all the stuff that I have tried. This is actually different than my original request. My original request was solved and works great, but we never ended up addressing the other aspect, plus I have changed how I need the files split first and then go back through the smaller chunks to pull out any large invoices.

Here are the variations of code that I am playing with if any of it is helpful...

Code:

use strict;
use constant
        {
                CHUNK_LIMIT     => 500 * 1024,
                EXCEPTION_LIMIT =>  25 * 1024,
        };

my $in_filename = 'infile.dat';

# Filename generators
{
        my $filenumber = 0;
        sub next_output_filename
        {
                return sprintf('outfile_'.$filenumber.'.dat', $filenumber++);
        }
}

{
        my $filenumber = 0;
        sub next_special_filename
        {
                return sprintf('special_'.$filenumber.'.dat', $filenumber++);
        }
}

{
        my $outfile_buffer = '';
        sub write_invoice
        {
                

		my ( $invoice, $force_flush ) = @_;

                my $invoice_len = length $invoice;
                my $buffer_len  = length $outfile_buffer;

                # If the invoice is special, write it immediately to
                # a Special file, bypassing the $outfile_buffer queueing.
                if ( $invoice_len >= EXCEPTION_LIMIT )
                {
                        open( next_special_filename(), $invoice );
                        return 1;
                }

                # If the invoice would make the $outfile_buffer too
                # big, flush it.
                my $too_big = $buffer_len + $invoice_len >= CHUNK_LIMIT;
                if ( $too_big or $force_flush )
                {
                        if ( $buffer_len )
                        {
                                open( next_output_filename(), $outfile_buffer );
                                $outfile_buffer = '';
                        }
                }

                # Store the invoice with the rest waiting to be
                # written to file.
                $outfile_buffer .= $invoice if $invoice;
                return 1;
        }
}
open my $in_fh, '<', $in_filename
  or die "Can't open '$in_filename': $!";

my $invoice = '';
while ( <$in_fh> ) {
    if ( length($_) >= 69 and substr( $_, 67, 2 ) eq '11' ) {
        write_invoice( $invoice );
        $invoice = '';
    }
    $invoice .= $_;
}
close $in_fh
  or warn "Can't close '$in_filename': $!";

write_invoice( $invoice ) if $invoice;
write_invoice( '', 'FORCE' );

and...

Code:

use strict;
use List::Util 'sum';

my $chunksize = 500000; # 500 KB
my $infile = "infile.dat";
my ($cnt, @buffer);

open INFILE, $infile;

while (<INFILE>)
{
        chomp;

        if (is_new_invoice($_))
        {
                my $size = sum(map length($_), @buffer);
                if ($size > $chunksize)
                {
                        flush_buffer();
                        push @buffer, $_;
                }
                else
                {
                        push @buffer, $_;
                }
        }
                else
                {
                        $buffer[-1] .= $_;
                }
}
flush_buffer();

sub flush_buffer
{
        # variables should probably be passed explicitly
        return if ! @buffer;
        ++$cnt;
        open (OUTFILE, ">outfile_".$cnt.".dat") or die "Unable to open outfile_".$cnt.".dat for writing: $!";
        my $size = 0;
        while (1)
{
        my $invoice = shift @buffer;
        last if ! $invoice;
        my $len = length($invoice);

        # If invoice by itself is larger than chunk, write to new file
        if ($len > $chunksize)
        {
            ++$cnt;
            open (OUTFILE, ">outfile_".$cnt.".dat") or die "Unable to open outfile_".$cnt.".dat for writing: $!";
            print (OUTFILE $invoice);
            close (OUTFILE);
            next;
        }

        # If this $invoice puts the current file over the limit, close the current file
        if ($size + $len > $chunksize)
        {
            close OUTFILE;

            # Replace the invoice back in the buffer and return if not end of input
            if (! eof INFILE)
        {
        unshift @buffer, $invoice;
        }
        else
                {
                # Write out whatever is left
                ++$cnt;
                open (OUTFILE, ">outfile_".$cnt.".dat") or die "Unable to open outfile_".$cnt.".dat for writing: $!";
                print OUTFILE $invoice;
                close OUTFILE;
                }
        return;
        }
        else
                {
            # Add this invoice to the existing open file
            print OUTFILE $invoice;
                }
        last if ! @buffer;
}
}

sub is_new_invoice
{
    my ($line) = @_;
    return 1 if substr($line, 67, 2) eq '11';
    return 0;
}

stevexff · Jun 5, 2007

MJD

What's the end-game here? Rather than solving little problems in your steps towards your goal, maybe if we knew what the goal was, we might be able to suggest a better way to do it altogether...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

KevinADC · Jun 5, 2007

Yeah I posted it on other forums hoping that someone might actually help.

ouch.....

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Parse large file and then go back through the files again

MightyJayDog

Technical User

travs69

MIS

KevinADC

Technical User

MightyJayDog

Technical User

stevexff

Programmer

KevinADC

Technical User

Similar threads

Part and Inventory Search

Sponsor