Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Split large data file based on content 1

Status
Not open for further replies.

MightyJayDog

Technical User
May 29, 2007
11
US
I am looking for some suggestions on how to split a large 13 Gb+ file into smaller more manageable sized file chunks based on content using Perl. I am trying to make it a little more generic as to exactly what data or content to split on, just hoping that someone can point me in the right direction. I am wanting to search through the file for a certain pattern (invoice end point) and then after a certain number of invoices or a certain size - find the end of the invoice and then split it off into it's own file. Any ideas? I am pretty new to Perl and being thrown into it to figure out this problem. :)

Jason
 
What will you be doing with the data afterwards?
If it needs splitting, I assume you want some info from within it. While you are splitting it, you could split all the invoices into seperate dB files and make the whole file much more accessible.

Keith
 
I take 1-2 GB data files and process invoices (tens of thousands of invoices if not more). I don't want to break it off into one file/invoice chunks, I am hoping to be able to break the 12-13 GB file into 1/2 - 1 GB files to continue processing and not hanging up the server. I just need to know how to split it at a certain point in the file based on the text.
 
you probably want to use perls inplace editor for something like this. It should be faster and more efficient when working with such a big file. What is the pattern in the file you need to search for?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Tens of thousands of invoices all in one continuous file - is this an archive?

Perl has a split function but I am not sure what the size limit is. You could try it but I think your server would object. You mention an 'invoice end point', is that a string which could not appear anywhere else in the file?
Iterating through and copying it piece by piece to a series of new files would be my best suggestion but someone else may have a better solution.


Keith
 
It is ASCii text in a dat file. It comes from the customer in this format and then I just need to break it out into smaller chunks, because 12-13 GB is just too big. The string or pattern will differ with each client where each invoice begins and ends. I first need to go through it and strip out or replace any funky characters and then split the file or at least that is what I am planning, just looking for some help on how to split it at say 500 MB and at the end of an invoice around that size.
 
I can't help if you don't answer my question. What is the pattern in the file you need to search for?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Outline code to split the file into approximately 500Mb chunks, ending on an end of invoice pattern or the end of the file:

Code:
[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$chunksize[/blue] = [fuchsia]500[/fuchsia] [blue]*[/blue] [fuchsia]1024[/fuchsia] [blue]*[/blue] [fuchsia]1024[/fuchsia][red];[/red] [gray][i]# 500Mb[/i][/gray]
[black][b]my[/b][/black] [blue]$filenumber[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$infile[/blue] = [red]"[/red][purple]infile.dat[/purple][red]"[/red][red];[/red]
[black][b]my[/b][/black] [blue]$outsize[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$eof[/blue] = [fuchsia]0[/fuchsia][red];[/red]

[url=http://perldoc.perl.org/functions/open.html][black][b]open[/b][/black][/url] INFILE, [blue]$infile[/blue][red];[/red]
[olive][b]while[/b][/olive][red]([/red]<INFILE>[red])[/red] [red]{[/red]
  [olive][b]if[/b][/olive][red]([/red][blue]$outsize[/blue] == [fuchsia]0[/fuchsia][red])[/red] [red]{[/red]
    [blue]$filenumber[/blue]++[red];[/red]
    [black][b]open[/b][/black] OUTFILE [red]"[/red][purple]>outfile [blue]$filenumber[/blue].dat[/purple][red]"[/red][red];[/red]
  [red]}[/red]
  
  [gray][i]## Do Transforms here...[/i][/gray]
  [gray][i]# tr/xxx/yyy/;[/i][/gray]
  
  [gray][i]## Identify end of invoice or end of file[/i][/gray]
  [gray][i]## end1..end4 are the different end of invoice identifiers[/i][/gray]

  [blue]$eof[/blue] = [url=http://perldoc.perl.org/functions/eof.html][black][b]eof[/b][/black][/url] INFILE or [red]([/red] [blue]$outsize[/blue]>[blue]$chunksize[/blue] and [red]/[/red][purple]^(end1|end2|end3|end4)$[/purple][red]/[/red] [red])[/red][red];[/red]
  
  [url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] OUTFILE[red];[/red]
  [blue]$outsize[/blue] += [url=http://perldoc.perl.org/functions/length.html][black][b]length[/b][/black][/url][red];[/red]
  
  [olive][b]if[/b][/olive][red]([/red][blue]$eof[/blue][red])[/red] [red]{[/red]
    [url=http://perldoc.perl.org/functions/close.html][black][b]close[/b][/black][/url] OUTFILE[red];[/red]
    [blue]$outsize[/blue] = [fuchsia]0[/fuchsia][red];[/red]
  [red]}[/red]
 
[red]}[/red]
[black][b]close[/b][/black] INFILE[red];[/red]
 
bleurgh... the $eof line should be AFTER the print & outsize increment lines.
 
MJD

While I appreciate that you want to chop the file up to make it easier to process, this still means that you've got to reassemble it at some time, an extra step to go wrong.

I'm guessing that you want to extract the pertinent bits of information for each invoice and write it to a (presumably much smaller) file. Given that most of the time spent processing a big file is I/O wait, you might be better off taking a single pass through the big file, writing a summary record at the end of each invoice, and doing all your subsequent processing on the smaller file.

This also has the benefit of allowing you to standardise the data extracted from invoice files from different clients which might not have the same parsing rules.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
The code that brigmar posted is a working script, but it doesn't seem to be splitting the file. The output file comes out the exact same size as the inout file.

The end of the invoice identifier that I am using is CIBC.

What did you mean by "Do Transforms here..."? What do I need to put in there?
 
Replace the open OUTFILE part:
Code:
  if($outsize == 0) {
    $filenumber++;
    open OUTFILE ">outfile ".$filenumber.".dat";
  }

Which should create: 'outfile 1.dat', 'outfile 2.dat', etc.

You said you wanted to 'remove funky characters', and that is the reason for the transforms.

This:
Code:
tr/\n/X/;
replaces occurences of "\n" with "X".


You could, for example, remove all tabs like this:
Code:
tr/\t//;
 
brigmar -- It is still just outputting one file. For testing purposes I am using a 200 Mb file and trying to break it into 500 Kb chunks...

Code:
#!usr/bin/perl -w

use strict;

my $chunksize = 500 * 1024; # 500Kb
my $filenumber = 0;
my $infile = "infile.dat";
my $outsize = 0;
my $eof = 0;

open INFILE, $infile;
while(<INFILE>)
  {
  if($outsize == 0) {
    $filenumber++;
    open OUTFILE, ">outfile ".$filenumber.".dat";
  }

#----------------------------------------------------------
# Do Transforms here... (get rid of funky characters)
#----------------------------------------------------------

# tr/xxx/yyy/;

  print OUTFILE;
  $outsize += length;

#----------------------------------------------------------
# Identify the end of invoice or end of file
# CIBC is the string that I am looking for at the end
#----------------------------------------------------------

  $eof = eof INFILE or ( $outsize>$chunksize and /^(CIBC)$/ );

  if($eof)
  {
    close OUTFILE;
    $outsize = 0;
  }

}
close INFILE;
 
Can you post some of your data, esp that portion around the 'CIBC' invoice terminator? (scrub any sensitive info though!)

The regex I posted is looking for 'CIBC' alone on a line. The format of your data may not be exactly that, which would cause the regex to fail and not close the file until the end of the input file.
 
is CIBC on a line by itself with nothing before or after it?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
No it isn't on a line by itself. Here is an example of the layout of one line of data. The X's represent names, addresses, account numbers, etc.

Code:
05/01/07        05/30/07                XXXXXXX                00.00    CIBC            X X X X - XXXXX                 XXXXXX XX XXXXXXXXXX            XXX XX XX XXXXXXXX XXXXX XXX    XXXXXXXXXXXX                    XX      XXX XXX         XX              XXX                                                    00.00
 
Is the CIBC going to be in the same place in that line for each record, and is the data AFTER the CIBC part of the same invoice ?
 
If the answer to the above questions are both yes, then:
Code:
  $eof = eof INFILE or ( $outsize>$chunksize and /^.{72}CIBC/ );

or, alternatively, using index() instead of a regex:

Code:
  $eof = eof INFILE or ( $outsize>$chunksize and index($_,"CIBC",72)==72 );
 
The answer is yes to both. I actually for test purposes copied that same line, over and over so that it is the only line in the data file and it is a 3 Mb file. So each line is identical and the CIBC is in the same spot for each one. I tried to use your first example without success - still just spitting out one .dat file that is the exact same file as the original. Same with your second suggestion. Here is my current code...

Code:
#!usr/bin/perl -w

use strict;

my $chunksize = 500 * 1024; # 500Kb
my $filenumber = 0;
my $infile = "infile.dat";
my $outsize = 0;
my $eof = 0;

open INFILE, $infile;
while(<INFILE>)
  {
  if($outsize == 0) {
    $filenumber++;
    open OUTFILE, ">outfile ".$filenumber.".dat";
  }

#-----------------------------------------------------------
# Do Transforms here... (get rid of funky characters)
#-----------------------------------------------------------

# tr/xxx/yyy/;

  print OUTFILE;
  $outsize += length;

#------------------------------------------------------------
# Identify the end of invoice or end of file
# CIBC is the string that I am looking for at the end
#------------------------------------------------------------

  $eof = eof INFILE or ( $outsize>$chunksize and /^.{72}CIBC/ );

  if($eof)
  {
    close OUTFILE;
    $outsize = 0;
  }

}
close INFILE;
 
Strange:
This..
Code:
$eof = (($outsize>$chunksize) and /^.{72}CIBC/) or eof(INFILE);
works, but this..
Code:
$eof = eof(INFILE) or (($outsize>$chunksize) and /^.{72}CIBC/);
doesn't.

In either case, given the size of your file, and that you know EXACTLY where the CIBC occurs, you'd be better off with the index() solution than the regex:

Code:
$eof = (($outsize>$chunksize) and index($_,"CIBC",72)==72) or eof(INFILE);
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top