extracting multiple documents from one document

cyphrix · Nov 24, 2006

I have a large text document that contains multiple reports in it. I want to be able to extract each report from the large text file and write them all to their own files. For example, say the text of the large file looks like this:

Report 1
this is just some random text for example purposes.
text text text text text text text text text text

Report 2
this is just some random text for example purposes.
text text text text text text text text text text

Report 3
this is just some random text for example purposes.
text text text text text text text text text text

There are just over 1200 reports in this one file and I need to be able to split them up into their own files. I have one script that is able to identify the beginning of each report, but I do not understand quite yet how to tell Perl to identify it, extract what is below it until it finds the instance of another report, and then write what it has to its own file. Below is code that I wrote that is able to identify the report. If someone could offer advice on how I could modify this code to extract each report, I would greatly appreciate it.

Code:

while(1) #loop until user quits
     print "\n\n\nFile To Be Processed: ";
     $File = <STDIN>;
     print "\n\nOutput File Name: ";
     $Outfile = <STDIN>;
     chomp($File);
     open(INFILE, $File) or die "The file cannot be found.";
     open(OUTFILE, ">>$Outfile");

     count = 0

     while(<INFILE>) {
          chomp($_);
          if(/(Report \d+)/) {
          count++;
          }
     }

print "There were $count reports in this document."
print OUTFILE "$1\n"

close INFILE
close OUTFILE

Again, this merely counts each instance of "Report *" which tells me exactly how many reports are in the file. Just want to extract each one to its own file. Thanks in advance for any help!

Kirsle · Nov 24, 2006

Code:

# open the report file
open (INFILE, "$File");

# Read the file
my $count = 0;
my $begin = 0; # this will become true when
   # the first report is found and we can begin
   # writing to the other files

while (<INFILE>) {
   chomp;
   if (/(Report \d+)/) {
      $count++;

      # We've found the start of a new report.
      # Open a filehandle for it; close the
      # filehandle if it was already opened.
      if ($begin == 1) {
         close (OUTFILE);
      }

      $begin = 1;

      open (OUTFILE, ">$count\_$Outfile");
      next;
   }

   if ($begin == 1) {
      print OUTFILE "$_\n";
   }
}

That should get you on the right track. It sets $begin to 0 before looping ($begin is a variable to tell the script whether or not it has found the first report yet; otherwise it won't be trying to write to a filehandle that hasn't been opened yet).

When it finds a record, then, if $begin is already set as 1 (meaning this is the 2nd or 3rd or 4th or ... record), it closes the filehandle. Then it (re)sets $begin to 1, opens a filehandle named $count_$Outfile (ie if your outfile was "outfile.txt", it would save reports in "1_outfile.txt", "2_outfile.txt", "3_outfile.txt", etc)

And on every line which doesn't indicate a report, if $begin is true (meaning at least the first record was found and that a filehandle had been opened), it prints the line to the filehandle.

-------------
Kirsle.net | Kirsle's Programs and Projects

cyphrix · Nov 24, 2006

thanks Kirsle... will try this out pronto and let you know what turns out.

cyphrix · Nov 24, 2006

Kirsle,

The code worked absolutely perfectly. Thanks for the help.

I do have one more question about the code though. I forgot to mention this. I've done this in Python, but can't seem to get it to work with the code you provided.

While the above code seperates each report and writes it to its own file, I would like it to be written to its own file only if it contains certain text. For instance, say the fourth report it finds contains the word "computer" and the fifth report doesn't. I want the script to, while its seperating each report, to simultaneosly scan the contents of each and only write the report to its own file IF it contains the word "computer". Hope this isn't a bother. Again, I have done it in Python but I'd like to know how to do it in Perl. Thanks in advance.

KevinADC · Nov 24, 2006

I've done this in Python, but can't seem to get it to work with the code you provided.

what have you tried?

- Kevin, perl coder unexceptional!

cyphrix · Nov 24, 2006

Actually, I figured out how to do it right after I posted my last reply. Thanks for the help!

stevexff · Nov 27, 2006

Care to post it? It might help someone else...

Also, this gives me a chance to rant on a bit. (Cyphrix, this isn't directed at you, it just happened to be the one that pushed me over the edge).

When posting a problem, PLEASE post the problem you actually have, not a simplified version of it. It's a bit frustrating to post a solution that works perfectly (like Kirsle's) only to have the OP come back and say "That's cool, but my real problem is actually this instead". This can have a huge effect on the design of a solution.

For example, K's solution assumes that every time a 'Report ###' line is hit, it needs to open a new file. Now, with the addition of the 'real' requirement, we need to buffer each found report in an array so we can check every line to see if it contains a keyword, before deciding whether or not to write the report file. This is substantially different to the original requirements.

OK, I feel much better now... [smile]

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

cyphrix · Nov 27, 2006

Steve,

I understand the rant. I hear ya and I will be sure to follow by it. Now...haha, I have another question.

This is the code that I have now thanks to Kirsle:

Code:

open(INFILE, "textfile.txt") or die "File not found.";

$doccount = 0;
$repcount = 0;

while (<INFILE>) {
     if (/(\*\*\* Document)/g) {
     $doccount++;

     if ($begin == 1) {
          close (OUTFILE);
     }

     $begin = 1;
     next;
     }

if (/([^US]+\W+?[-|+]?\d+[NEWS]+\W+?)+/m) {
     open (OUTFILE, ">>$doccount\.txt");
     print OUTFILE "$_";
     }

}

print "$doccount documents found.\n";

Now first, what this code does is simply search one text file that is chock full of reports. Each report begins with *** Document so that is what it splits them by. It them counts each one and whichever ones contain anything that matches the pattern, it creates a file for it and prints the matches to their respective files. I cannot, CANNOT, for the life of me find a way to simply print to screen how many reports contained matches. I have tried literally all day long. I tried placing the variable $repcount under the if statement, however, all that did was count the number of pattern matches and print that...I want to know how many reports had pattern matches and print that to screen. Any help is appreciated...still a newbie to Perl so don't beat me up too bad. Thanks

stevexff · Nov 27, 2006

Code:

use strict;
use warnings;

my $reports, $report, $keep, @data;

while (<DATA>) {
   print_it() if (/^Report\s+(\d+)/);
   push @data, $_;
   $keep++ if (/([^US]+\W+?[-|+]?\d+[NEWS]+\W+?)+/m);
}

print_it();

print "Wrote $reports reports\n";

sub print_it {
   if ($keep) {
      open(OUT, ">Report_$report.txt") or die "Error writing report $report: $!";
      print OUT join("", @data);
      close(OUT);
      $reports++;
   }
   $report = $1;
   $keep = 0;
   undef @data;
}


__DATA__
Report 1
1
2
X +3N
4
5
Report 2
x
x
x
Report 3
X +3N

Repeatedly opening the output file for append as you process each line is hugely inefficient. This one buffers the report in an array prior to writing it out.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

rharsh · Nov 27, 2006

Here's a different approach that may be of some use:

Code:

open INPUT, "< bigfile.txt" or die;

my ($total_records, $good_records, $keep);
my @temp;

while (<INPUT>) {
    if (/\*\*\* Document/) {
        $total_records++;
        if (@temp) {
           if ($keep) {
               $good_records++;
               print_array(\@temp, $total_records);
           }
           @temp = ();
           $keep = 0;
        }
    } else {
        $keep = 1 if /z/;
    }
    push @temp, $_;
}

# Check the last record
if (@temp) {
    if ($keep) {
       $good_records++;
       print_array(\@temp, $total_records);
    }
}

print "Total Records: $total_records\n";
print "Good Records: $good_records\n";

sub print_record {
    my ($aref, $file_num) = @_;
    open OUTFILE, "> record_${file_num}.txt" or die "Cannot create record_${file_num}.txt\n$!\n";
    foreach (@{$aref}) {
        print OUTFILE $_;
    }
    close OUTFILE;
}

The bigfile.txt looks like:

Code:

*** Document
a
z
c
*** Document
d
e
f
*** Document
g
h
z

In this case, the code looks for a 'z' and prints the record to a file if that character is present.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

extracting multiple documents from one document

cyphrix

Programmer

Kirsle

Programmer

cyphrix

Programmer

cyphrix

Programmer

KevinADC

Technical User

cyphrix

Programmer

stevexff

Programmer

cyphrix

Programmer

stevexff

Programmer

rharsh

Technical User

Similar threads

Part and Inventory Search

Sponsor