Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parse large file and then go back through the files again

Status
Not open for further replies.

MightyJayDog

Technical User
May 29, 2007
11
US
I have a large data file that I got some help from brigmar to split into smaller more manageable chunks (went from a 12.86 GB file to 500 MB - 1.6 GB chunks).

I now want to add to the PERL script and go back through those chunks and pull out any invoices within those smaller data files that are larger than 250 MB each and print them to their own files as well.

How do I go about doing that?

Here is what I am currently working with...

Code:
#!usr/bin/perl -w

my $chunksize = 500000000; # 500MB
my $filenumber = 0;
my $infile = "infile.dat";
my $outsize = 0;
my $eof = 0;

open INFILE, $infile;
open OUTFILE, ">outfile_".$filenumber.".dat";

while(<INFILE>)
{
        chomp;

        if( $outsize>$chunksize and /^.{67}11/ )
        {
                close OUTFILE;
                $outsize = 0;
                $filenumber++;
                open (OUTFILE, ">outfile_".$filenumber.".dat") or die "Can't open outfile_".$filenumber.".dat";
        }

        print OUTFILE "$_\n";
        $outsize += length;

}
close INFILE;
 
What code changes have you tried? That looks like the code from your last question. You should try and figure it out on you own first :)
 
I see this question on several forums now.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
travs69 - thanks for the help :)

Yeah I posted it on other forums hoping that someone might actually help.

I didn't want to muck up the question with all the stuff that I have tried. This is actually different than my original request. My original request was solved and works great, but we never ended up addressing the other aspect, plus I have changed how I need the files split first and then go back through the smaller chunks to pull out any large invoices.

Here are the variations of code that I am playing with if any of it is helpful...

Code:
use strict;
use constant
        {
                CHUNK_LIMIT     => 500 * 1024,
                EXCEPTION_LIMIT =>  25 * 1024,
        };

my $in_filename = 'infile.dat';

# Filename generators
{
        my $filenumber = 0;
        sub next_output_filename
        {
                return sprintf('outfile_'.$filenumber.'.dat', $filenumber++);
        }
}

{
        my $filenumber = 0;
        sub next_special_filename
        {
                return sprintf('special_'.$filenumber.'.dat', $filenumber++);
        }
}

{
        my $outfile_buffer = '';
        sub write_invoice
        {
                

		my ( $invoice, $force_flush ) = @_;

                my $invoice_len = length $invoice;
                my $buffer_len  = length $outfile_buffer;

                # If the invoice is special, write it immediately to
                # a Special file, bypassing the $outfile_buffer queueing.
                if ( $invoice_len >= EXCEPTION_LIMIT )
                {
                        open( next_special_filename(), $invoice );
                        return 1;
                }

                # If the invoice would make the $outfile_buffer too
                # big, flush it.
                my $too_big = $buffer_len + $invoice_len >= CHUNK_LIMIT;
                if ( $too_big or $force_flush )
                {
                        if ( $buffer_len )
                        {
                                open( next_output_filename(), $outfile_buffer );
                                $outfile_buffer = '';
                        }
                }

                # Store the invoice with the rest waiting to be
                # written to file.
                $outfile_buffer .= $invoice if $invoice;
                return 1;
        }
}
open my $in_fh, '<', $in_filename
  or die "Can't open '$in_filename': $!";

my $invoice = '';
while ( <$in_fh> ) {
    if ( length($_) >= 69 and substr( $_, 67, 2 ) eq '11' ) {
        write_invoice( $invoice );
        $invoice = '';
    }
    $invoice .= $_;
}
close $in_fh
  or warn "Can't close '$in_filename': $!";

write_invoice( $invoice ) if $invoice;
write_invoice( '', 'FORCE' );

and...

Code:
use strict;
use List::Util 'sum';

my $chunksize = 500000; # 500 KB
my $infile = "infile.dat";
my ($cnt, @buffer);

open INFILE, $infile;

while (<INFILE>)
{
        chomp;

        if (is_new_invoice($_))
        {
                my $size = sum(map length($_), @buffer);
                if ($size > $chunksize)
                {
                        flush_buffer();
                        push @buffer, $_;
                }
                else
                {
                        push @buffer, $_;
                }
        }
                else
                {
                        $buffer[-1] .= $_;
                }
}
flush_buffer();

sub flush_buffer
{
        # variables should probably be passed explicitly
        return if ! @buffer;
        ++$cnt;
        open (OUTFILE, ">outfile_".$cnt.".dat") or die "Unable to open outfile_".$cnt.".dat for writing: $!";
        my $size = 0;
        while (1)
{
        my $invoice = shift @buffer;
        last if ! $invoice;
        my $len = length($invoice);

        # If invoice by itself is larger than chunk, write to new file
        if ($len > $chunksize)
        {
            ++$cnt;
            open (OUTFILE, ">outfile_".$cnt.".dat") or die "Unable to open outfile_".$cnt.".dat for writing: $!";
            print (OUTFILE $invoice);
            close (OUTFILE);
            next;
        }

        # If this $invoice puts the current file over the limit, close the current file
        if ($size + $len > $chunksize)
        {
            close OUTFILE;

            # Replace the invoice back in the buffer and return if not end of input
            if (! eof INFILE)
        {
        unshift @buffer, $invoice;
        }
        else
                {
                # Write out whatever is left
                ++$cnt;
                open (OUTFILE, ">outfile_".$cnt.".dat") or die "Unable to open outfile_".$cnt.".dat for writing: $!";
                print OUTFILE $invoice;
                close OUTFILE;
                }
        return;
        }
        else
                {
            # Add this invoice to the existing open file
            print OUTFILE $invoice;
                }
        last if ! @buffer;
}
}

sub is_new_invoice
{
    my ($line) = @_;
    return 1 if substr($line, 67, 2) eq '11';
    return 0;
}
 
MJD

What's the end-game here? Rather than solving little problems in your steps towards your goal, maybe if we knew what the goal was, we might be able to suggest a better way to do it altogether...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Yeah I posted it on other forums hoping that someone might actually help.

ouch..... [bugeyed]

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top