Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Spliting 1 file in multiple

Status
Not open for further replies.

gm199

Programmer
Aug 9, 2001
37
US
Hi,
I'm the most stupid programmer when using REGEX!
I have a huge html file and want to split in n small files.
Chuncks of texts to be saved in individual files begins with:
</p><hr width="75%"><center><tt>-387-</tt></center><br>
then the texts goes in multiple lines up to the next line like this one. Seems I can use the <hr width="75%"> as delimiter and the file name must be from <center><tt>-387-</tt></center>
(file387.txt, in this case.)
Can someone help me with the REGEX?
I can get the file$name
$name= /<tt>-(.*)-<\/tt><\/center><br>/;
but have no idea on how to get multiple line after this.
Thank you for any help.

full sample:

</p><hr width="75%"><center><tt>-330-</tt></center><br>
from them must travel unimaginable millions of miles
to reach him. As the world forces become impersonal
they become more majestic, and a deeper feeling is
evoked in their presence. Science aids true religion by
increasing awe, by increasing knowledge.

</p><hr width="75%"><center><tt>-331-</tt></center><br>
...
 
Code:
use strict;
use warnings;
my $sentinel;

while (<DATA>) {
    if (m[^</p><hr width="75%"><center><tt>-(\d+)-</tt></center><br>$]) {
        close(OUT) if $sentinel;
        $sentinel = open (OUT, ">file$1.txt") or die "Can't create file file$1.txt: $!";
    }
    print OUT "$_" if $sentinel;
}
close (OUT);

__DATA__
any rubbish before 1st separator
</p><hr width="75%"><center><tt>-330-</tt></center><br>
  from them must travel unimaginable millions of miles     
  to reach him. As the world forces become impersonal     
  they become more majestic, and a deeper feeling is     
  evoked in their presence. Science aids true religion by     
  increasing awe, by increasing knowledge.

</p><hr width="75%"><center><tt>-331-</tt></center><br>
  from them must travel unimaginable millions of miles     
  to reach him. As the world forces become impersonal     
  they become more majestic, and a deeper feeling is     
  evoked in their presence. Science aids true religion by     
  increasing awe, by increasing knowledge.
ought to do it.
 
Thanks for the help but it produces no results (nor warning, nor erros, nothing).
Any sugestion?

Here is my script with your code:
Code:
use strict;
use warnings;
my $sentinel;
my $file;

$file = "personality.htm";

open(DATA,$file) or die "Can not open $file\n";

while (<DATA>) {
    if (m[^</p><hr width="75%"><center><tt>-(\d+)-</tt></center><br>$]) {
        close(OUT) if $sentinel;
        $sentinel = open (OUT, ">$1.htm") or die "Can't create file $1.htm: $!";
    }
    print OUT "$_" if $sentinel;
}
close (OUT);

close (DATA);
 
Well, it's obviously opening the input file OK, as it would die otherwise. It must not be matching your separator.

Try removing the ^ and $ from the beginning and end of the regular expression respectively, which will make the matching rules less strict.

Maybe your data doesn't exactly match what you posted, as my example works with the posted data.
 
Surely It won't - the only print statement is to the OUT filehandle

Try checking for the existence of any new files in your directory...


Kind Regards
Duncan
 
Oops, missing the obvious here - it won't produce any output to the screen, I see you've dropped the 'file' literal from the beginning of the output filenames. Just check the directory for files of the form n.htm, where n is the number...
 
YES, YES, YES!
I just get it to work by removing the ^. Since I'm using the command line, the output was just there like a baby after running.
Thank you all for the great help.
Best wishes

Gilnei Moraes
 
FINAL CODE AS SAMPLE TO OTHERS

This is my final script. It opens the big source HTML file and split in small ones with html correct syntax.

It searches for a line like <hr width="75%"><center><tt>-17-</tt></center><br> and then do the job.

Hope someone in need can use it to solve their problem as I did.

Code:
my $sentinel;
my $file;
my $gm;

$file = "personality.htm";

open(DATA,$file) or die "Can not open $file\n";

while (<DATA>) {
    if (m[<hr width="75%"><center><tt>-(\d+)-</tt></center><br>$]) {
        print OUT "<table align=\"center\"><tr><td><a href=\"".($gm-1) . ".htm\">Prev</a></td><td><a href=\"".($gm+1) . ".htm\">Next</a></td></tr></table>\n</body></html>" if $sentinel;
		close(OUT) if $sentinel;
        $sentinel = open (OUT, ">$1.htm") or die "Can't create file $1.htm: $!";
		$gm = "$1";
    }

	if ($sentinel) {

	$_ =~ s{<hr width="75%">}{<html><head><title>Page $gm</title></head><body>\n};
	print OUT "$_";

	}

}
close (OUT);

close (DATA);
 
it the html file is yours to edit (which it looks like), you could just put some "flags" in the html code to make it easy to break up:

Code:
<!--start-->
blah blah blah
<!--end-->


 
I just hope you understand why your script is working now - you seem to have totally ignored the fact that your original post stated that your files were as follows:-

</p><hr width="75%"><center><tt>-330-</tt></center><br>
  from them must travel unimaginable millions of miles     
  to reach him. As the world forces become impersonal     
  they become more majestic, and a deeper feeling is     
  evoked in their presence. Science aids true religion by     
  increasing awe, by increasing knowledge.

</p><hr width="75%"><center><tt>-331-</tt></center><br>
...

Now you are looking for:-
<hr width="75%"><center><tt>-(\d+)-</tt></center><br>$

Where has the </p> gone!?

i.e. you have removed the ^ (caret - beginning-of-line marker) to get the script to work - without really understanding why. It's entirely up to you but you are not going to make much progress unless you understand why. One day you are going to need that beginning-of-line marker...


Kind Regards
Duncan
 
Also, the if-afterwards construct
Code:
close(OUT) if $sentinel;
is a shorthand that's handy if you only have one conditional statement. When there is more than one, put them in a block
Code:
if $sentinel {
   close(OUT);
   print OUT "verbose html line";
}
as it makes it easier to see what's going on.

Anyway, I'm glad you managed to get it working.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top