Split large data file based on content 1

MightyJayDog · May 29, 2007

I am looking for some suggestions on how to split a large 13 Gb+ file into smaller more manageable sized file chunks based on content using Perl. I am trying to make it a little more generic as to exactly what data or content to split on, just hoping that someone can point me in the right direction. I am wanting to search through the file for a certain pattern (invoice end point) and then after a certain number of invoices or a certain size - find the end of the invoice and then split it off into it's own file. Any ideas? I am pretty new to Perl and being thrown into it to figure out this problem.

Jason

audiopro · May 29, 2007

What will you be doing with the data afterwards?
If it needs splitting, I assume you want some info from within it. While you are splitting it, you could split all the invoices into seperate dB files and make the whole file much more accessible.

Keith

http://www.studiosoft.co.uk

MightyJayDog · May 29, 2007

I take 1-2 GB data files and process invoices (tens of thousands of invoices if not more). I don't want to break it off into one file/invoice chunks, I am hoping to be able to break the 12-13 GB file into 1/2 - 1 GB files to continue processing and not hanging up the server. I just need to know how to split it at a certain point in the file based on the text.

KevinADC · May 29, 2007

you probably want to use perls inplace editor for something like this. It should be faster and more efficient when working with such a big file. What is the pattern in the file you need to search for?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

audiopro · May 29, 2007

Tens of thousands of invoices all in one continuous file - is this an archive?

Perl has a split function but I am not sure what the size limit is. You could try it but I think your server would object. You mention an 'invoice end point', is that a string which could not appear anywhere else in the file?
Iterating through and copying it piece by piece to a series of new files would be my best suggestion but someone else may have a better solution.

Keith

http://www.studiosoft.co.uk

MightyJayDog · May 29, 2007

It is ASCii text in a dat file. It comes from the customer in this format and then I just need to break it out into smaller chunks, because 12-13 GB is just too big. The string or pattern will differ with each client where each invoice begins and ends. I first need to go through it and strip out or replace any funky characters and then split the file or at least that is what I am planning, just looking for some help on how to split it at say 500 MB and at the end of an invoice around that size.

KevinADC · May 29, 2007

I can't help if you don't answer my question. What is the pattern in the file you need to search for?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

brigmar · May 29, 2007

Outline code to split the file into approximately 500Mb chunks, ending on an end of invoice pattern or the end of the file:

Code:

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$chunksize[/blue] = [fuchsia]500[/fuchsia] [blue]*[/blue] [fuchsia]1024[/fuchsia] [blue]*[/blue] [fuchsia]1024[/fuchsia][red];[/red] [gray][i]# 500Mb[/i][/gray]
[black][b]my[/b][/black] [blue]$filenumber[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$infile[/blue] = [red]"[/red][purple]infile.dat[/purple][red]"[/red][red];[/red]
[black][b]my[/b][/black] [blue]$outsize[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$eof[/blue] = [fuchsia]0[/fuchsia][red];[/red]

[url=http://perldoc.perl.org/functions/open.html][black][b]open[/b][/black][/url] INFILE, [blue]$infile[/blue][red];[/red]
[olive][b]while[/b][/olive][red]([/red]<INFILE>[red])[/red] [red]{[/red]
  [olive][b]if[/b][/olive][red]([/red][blue]$outsize[/blue] == [fuchsia]0[/fuchsia][red])[/red] [red]{[/red]
    [blue]$filenumber[/blue]++[red];[/red]
    [black][b]open[/b][/black] OUTFILE [red]"[/red][purple]>outfile [blue]$filenumber[/blue].dat[/purple][red]"[/red][red];[/red]
  [red]}[/red]
  
  [gray][i]## Do Transforms here...[/i][/gray]
  [gray][i]# tr/xxx/yyy/;[/i][/gray]
  
  [gray][i]## Identify end of invoice or end of file[/i][/gray]
  [gray][i]## end1..end4 are the different end of invoice identifiers[/i][/gray]

  [blue]$eof[/blue] = [url=http://perldoc.perl.org/functions/eof.html][black][b]eof[/b][/black][/url] INFILE or [red]([/red] [blue]$outsize[/blue]>[blue]$chunksize[/blue] and [red]/[/red][purple]^(end1|end2|end3|end4)$[/purple][red]/[/red] [red])[/red][red];[/red]
  
  [url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] OUTFILE[red];[/red]
  [blue]$outsize[/blue] += [url=http://perldoc.perl.org/functions/length.html][black][b]length[/b][/black][/url][red];[/red]
  
  [olive][b]if[/b][/olive][red]([/red][blue]$eof[/blue][red])[/red] [red]{[/red]
    [url=http://perldoc.perl.org/functions/close.html][black][b]close[/b][/black][/url] OUTFILE[red];[/red]
    [blue]$outsize[/blue] = [fuchsia]0[/fuchsia][red];[/red]
  [red]}[/red]
 
[red]}[/red]
[black][b]close[/b][/black] INFILE[red];[/red]

brigmar · May 29, 2007

bleurgh... the $eof line should be AFTER the print & outsize increment lines.

stevexff · May 29, 2007

MJD

While I appreciate that you want to chop the file up to make it easier to process, this still means that you've got to reassemble it at some time, an extra step to go wrong.

I'm guessing that you want to extract the pertinent bits of information for each invoice and write it to a (presumably much smaller) file. Given that most of the time spent processing a big file is I/O wait, you might be better off taking a single pass through the big file, writing a summary record at the end of each invoice, and doing all your subsequent processing on the smaller file.

This also has the benefit of allowing you to standardise the data extracted from invoice files from different clients which might not have the same parsing rules.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

MightyJayDog · May 30, 2007

The code that brigmar posted is a working script, but it doesn't seem to be splitting the file. The output file comes out the exact same size as the inout file.

The end of the invoice identifier that I am using is CIBC.

What did you mean by "Do Transforms here..."? What do I need to put in there?

brigmar · May 30, 2007

Replace the open OUTFILE part:

Code:

  if($outsize == 0) {
    $filenumber++;
    open OUTFILE ">outfile ".$filenumber.".dat";
  }

Which should create: 'outfile 1.dat', 'outfile 2.dat', etc.

You said you wanted to 'remove funky characters', and that is the reason for the transforms.

This:

Code:

tr/\n/X/;

replaces occurences of "\n" with "X".

http://perldoc.perl.org/perlop.html#tr/SEARCHLIST/REPLACEMENTLIST/cds-tr-y-transliterate-/c-/d-/s

You could, for example, remove all tabs like this:

Code:

tr/\t//;

MightyJayDog · May 30, 2007

brigmar -- It is still just outputting one file. For testing purposes I am using a 200 Mb file and trying to break it into 500 Kb chunks...

Code:

#!usr/bin/perl -w

use strict;

my $chunksize = 500 * 1024; # 500Kb
my $filenumber = 0;
my $infile = "infile.dat";
my $outsize = 0;
my $eof = 0;

open INFILE, $infile;
while(<INFILE>)
  {
  if($outsize == 0) {
    $filenumber++;
    open OUTFILE, ">outfile ".$filenumber.".dat";
  }

#----------------------------------------------------------
# Do Transforms here... (get rid of funky characters)
#----------------------------------------------------------

# tr/xxx/yyy/;

  print OUTFILE;
  $outsize += length;

#----------------------------------------------------------
# Identify the end of invoice or end of file
# CIBC is the string that I am looking for at the end
#----------------------------------------------------------

  $eof = eof INFILE or ( $outsize>$chunksize and /^(CIBC)$/ );

  if($eof)
  {
    close OUTFILE;
    $outsize = 0;
  }

}
close INFILE;

brigmar · May 30, 2007

Can you post some of your data, esp that portion around the 'CIBC' invoice terminator? (scrub any sensitive info though!)

The regex I posted is looking for 'CIBC' alone on a line. The format of your data may not be exactly that, which would cause the regex to fail and not close the file until the end of the input file.

KevinADC · May 30, 2007

is CIBC on a line by itself with nothing before or after it?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

MightyJayDog · May 30, 2007

No it isn't on a line by itself. Here is an example of the layout of one line of data. The X's represent names, addresses, account numbers, etc.

Code:

05/01/07        05/30/07                XXXXXXX                00.00    CIBC            X X X X - XXXXX                 XXXXXX XX XXXXXXXXXX            XXX XX XX XXXXXXXX XXXXX XXX    XXXXXXXXXXXX                    XX      XXX XXX         XX              XXX                                                    00.00

brigmar · May 30, 2007

Is the CIBC going to be in the same place in that line for each record, and is the data AFTER the CIBC part of the same invoice ?

brigmar · May 30, 2007

If the answer to the above questions are both yes, then:

Code:

  $eof = eof INFILE or ( $outsize>$chunksize and /^.{72}CIBC/ );

or, alternatively, using index() instead of a regex:

Code:

  $eof = eof INFILE or ( $outsize>$chunksize and index($_,"CIBC",72)==72 );

MightyJayDog · May 30, 2007

The answer is yes to both. I actually for test purposes copied that same line, over and over so that it is the only line in the data file and it is a 3 Mb file. So each line is identical and the CIBC is in the same spot for each one. I tried to use your first example without success - still just spitting out one .dat file that is the exact same file as the original. Same with your second suggestion. Here is my current code...

Code:

#!usr/bin/perl -w

use strict;

my $chunksize = 500 * 1024; # 500Kb
my $filenumber = 0;
my $infile = "infile.dat";
my $outsize = 0;
my $eof = 0;

open INFILE, $infile;
while(<INFILE>)
  {
  if($outsize == 0) {
    $filenumber++;
    open OUTFILE, ">outfile ".$filenumber.".dat";
  }

#-----------------------------------------------------------
# Do Transforms here... (get rid of funky characters)
#-----------------------------------------------------------

# tr/xxx/yyy/;

  print OUTFILE;
  $outsize += length;

#------------------------------------------------------------
# Identify the end of invoice or end of file
# CIBC is the string that I am looking for at the end
#------------------------------------------------------------

  $eof = eof INFILE or ( $outsize>$chunksize and /^.{72}CIBC/ );

  if($eof)
  {
    close OUTFILE;
    $outsize = 0;
  }

}
close INFILE;

brigmar · May 30, 2007

Strange:
This..

Code:

$eof = (($outsize>$chunksize) and /^.{72}CIBC/) or eof(INFILE);

works, but this..

Code:

$eof = eof(INFILE) or (($outsize>$chunksize) and /^.{72}CIBC/);

doesn't.

In either case, given the size of your file, and that you know EXACTLY where the CIBC occurs, you'd be better off with the index() solution than the regex:

Code:

$eof = (($outsize>$chunksize) and index($_,"CIBC",72)==72) or eof(INFILE);

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Split large data file based on content 1

Technical User

Programmer

Technical User

Technical User

Programmer

Technical User

Technical User

Programmer

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Technical User

Programmer

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor