Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Perl mass cut and paste into template 1

Status
Not open for further replies.

cnycsjohn

IS-IT--Management
Sep 21, 2010
17
Hello everyone!

Thank you for taking the time to look at this.

Basically what i'm attempting to do is open every file in a directory, look at the contents and take the data between two statements (groups of html tags) that are consistently in each document and create a new file in a second folder with the same name as the original file, containing the contents of another document (the template) with the copied text inserted between two othertags in the template.

I know it's easy for those of you who know perl, unfortunately I'm not certain where to begin, and google searches using my terminology aren't throwing up anything i can hack and slash at.

Thanks Again
- John
 
So... reading through the forum posts , i've hack and slashed a little

and ended up here :

#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode

my @htmlFiles = glob(pathToFiles/*.html) or die "can't open directory $!\n";
my @linesFromFile=<DATA>;

foreach my $c(@htmlFiles){
open (HTML, "$c");


for ( @linesFromFile ) {

if ( /^StartTAGS/ ... /^EndTags/ ) {

print > /folder2/$c;

}


Does this look even remotely in the right ballpark?
 
You're in the ballpark, but your code might behave like a bowling ball.

- Your glob statement will collect a list of files under the array @htmlFiles. In your final code you'll want something like: my @htmlFiles = glob("$pathToFiles/*.html");
- The foreach and open statements look reasonable (no need for the quotes around $c)
- I'm guessing that the @linesFromFile are a list of the tags you're looking for. If so, you'll need to add a __DATA__ at the end of your code followed by the actual tags.

After that things start to become very muddled. If your tags happen to span more than one line of html code, you'll have to consider either trying to RE across multiple lines (eg. $html_data =~ /$rePattern/m ; or $html_data =~ /$rePattern/s ;) While I can hold my own with REs, but perl's expanse of them are best dealt with in the same manner one would memorize multiplication tables. You're either going to know them or you won't. There are entire books on this.

Another potential issue may be multiple common tags or embedded tags. Consider:
EX1:
<tagABC>foo A</tagABC>
<tagABC>foo B</tagABC>
If you search for /<tagABC>.*</tagABC>/, depending on how you're data is read in, you may match 'foo A' then 'foo B' or 'foo A</tagABC><tagABC>foo B'
EX2:
<tagABC>foo A<tagDEF>foo B</tagDEF></tagABC>
Depending on the order you look for tagABC and tagDEF may affect the outcome.
EX3:
<tagABC>foo A
</tagABC>
Here your beginning and ending tags are not on the same line. While it is possible to read the entire file in at one time, but for large files, this may be problematic.

One way out of this could be recursive functions - but those can be a tad finicky at times. There may be a another poster in here that has a killer app that parses this out. If so, good. I'm just thinking that this may be more difficult than it first appears.
 
Thanks for the info :)

To be a little more clear , I have 385 files that have data contained between the following :

<table><tr>
<td width="200"><font face="Arial" size="2" color="Maroon">

and

</table>
</center>
</body>
</html>


I'd like to Insert the text between two editable region dreamweaver tags.

<!-- InstanceBeginEditable name="Body Of Site" -->

and

<!-- InstanceEndEditable --> that directly follows it. there are multiple instances of this in the file
 
To make things simpler I'm thinking I can just take everything between

<td width="200"><font face="Arial" size="2" color="Maroon">

and </body>


 
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode

my @htmlFiles = glob($pathToFiles/*.html) or die "can't open directory $!\n";


foreach my $c(@htmlFiles){

open (HTML, $c);


for ( $linesFromFile ) {

if ( /^StartTags/gm .. /^EndTags/gm ) {

print > /folder2/$c;
}
}
 
does that look a little more in the right direction ?

at least for stage 1 of the issue , getting raw data out

 
[jmackenz@phoenix perltest]$ perl removedata.pl
Global symbol "$linesFromFile" requires explicit package name at removedata.pl line 14.
Execution of removedata.pl aborted due to compilation errors.


 
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode

my @htmlFiles = glob("$~/perltest/*.html") or die "can't open directory $!\n";


foreach my $c(@htmlFiles){

open (HTML, $c);


#for ( $linesFromFile ) {

if ( /^StartTags/gm .. /^EndTags/gm ) {

print $1 > "~/folder2/$c";
#}
}}



results in :

[jmackenz@phoenix perltest]$ perl removedata.pl
Name "main::HTML" used only once: possible typo at removedata.pl line 11.
can't open directory No such file or directory
 
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode

my @htmlFiles = glob("~/perltest/*.html") or die "can't open directory $!\n";


foreach my $c(@htmlFiles){

open (HTML, $c);


#for ( $linesFromFile ) {

if ( /^StartTags/gm .. /^EndTags/gm ) {

print $1 > "~/folder2/$c";
#}
}}



results in :

[jmackenz@phoenix perltest]$ perl removedata.pl
Name "main::HTML" used only once: possible typo at removedata.pl line 11.
Use of uninitialized value in pattern match (m//) at removedata.pl line 16.
Use of uninitialized value in pattern match (m//) at removedata.pl line 16.

and also

#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode

my @htmlFiles = glob("~/perltest/*.html") or die "can't open directory $!\n";


foreach my $c(@htmlFiles){

open (HTML, $c);


#for ( $linesFromFile ) {

#if ( /^StartTags/gm .. /^EndTags/gm ) {

# print $1 > "~/folder2/$c";
#}



while(/^StartTags(.+?)^EndTags/gms){

print$1; }


}


A different take on the same beast ,

Still has same Error :

[jmackenz@phoenix perltest]$ perl removedata.pl
Name "main::HTML" used only once: possible typo at removedata.pl line 11.
Use of uninitialized value in pattern match (m//) at removedata.pl line 23.
Use of uninitialized value in pattern match (m//) at removedata.pl line 23.


So maybe something wrong with my foreach or Glob statements?
 
Also, can someone explain what the /gm and /gms actually mean?

I'm guessing my foreach statement is lacking a step?

 
Someone on another forum said :

"You'll eventually be matching HTML tags. Will these ever change or will they always be exactly as you've posted them? If they're likely to be any different at any stage, you'd probably be better off using a proper tag-aware HTML parser (such as HTML::TokeParser::Simple or similar) rather than using regexps."

If I want this is a re-usable tool , what are your thoughts on something like this?
 
Code:
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode

my @html_files = glob("~/perltest/*.html") or die "can't open directory $!\n";
my $new_folder = 'theoutput/';

foreach my $doc   (@html_files)   {

        print "Processing $doc\n";
        my $new_file = "$new_folder$doc";


open STRIPME, "> $new_file" or die "Cannot open $new_file for writing: $!";


        while(/^StartTags(.+?)^EndTags/gms){

        print$1;  }

        close STRIPME;

}


Think this is closer to the right track ... but :

I get the following

[jmackenz@phoenix perltest]$ perl removedata.pl
Processing /home/jmackenz/perltest/test.html
Cannot open theoutput//home/jmackenz/perltest/test.html for writing: No such file or directory at removedata.pl line 15.

The folder is supposed to be /home/jmackenz/perltest/theoutput

and nothing ever gets created there anyway thus far
 
Code:
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode
my @html_files = glob("*.html") or die "can't open directory $!\n";
my $new_folder = 'theoutput/';

foreach my $original_doc   (@html_files)   {

        print "Processing $original_doc\n";
        my $new_file = "$new_folder$original_doc";
        print "new_file variable  is : $new_file\n";
        open (ORIGINAL, $original_doc);
        open STRIPME, "> $new_file" or die "Cannot open $new_file for writing: $!";
while (<ORIGINAL>) {
        if ( /^StartTags/gm .. /^EndTags/gm ) {

        print STRIPME "$_";}
}
}

Works Well !

ok... now as far as it handling the real tags? is that as simple as just putting them in place of StartTags et al , or is there a way I need to quote or format them ?

Also, can I add blank file handling? In the case of test2.html which is a non match
 
Code:
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode
my @html_files = glob("*.html") or die "can't open directory $!\n";
my $new_folder = 'theoutput/';
my $StartTags  = "\<td width=\"200\"\>\<font face=\"Arial\" size=\"2\" color=\"Maroon\"\>";
my $EndTags    = "\<\/body\>";


print "\n";
print "\n";
print "\n";

print "Searching for data between $StartTags & $EndTags\n";
print "\n";

foreach my $original_doc   (@html_files)   {

        print "Processing $original_doc\n";
        my $new_file = "$new_folder$original_doc";
#        print "new_file variable  is : $new_file\n";
        print "\n";
        open (ORIGINAL, $original_doc);
        open STRIPME, "> $new_file" or die "Cannot open $new_file for writing: $!";


        while (<ORIGINAL>) {
        if ( /^$StartTags/gm .. /^$EndTags/gm ) {

        print STRIPME "$_";}
}
}
 
the above worked to get the raw data out.
 
Code:
#!/usr/bin/perl -w
use strict;

#local $/; # slurp mode
my @html_files = glob("*.html") or die "can't open directory $!\n";
my $new_folder = 'theoutput/';
my $StartTags  = "\<td width=\"200\"\>\<font face=\"Arial\" size=\"2\" color=\"Maroon\"\>";
my $EndTags    = "\<\/body\>";
my $template1  = "part1";
my $template2  = "part2";

print "\n";
print "\n";
print "\n";

print "Searching for data between $StartTags & $EndTags\n";
print "\n";

foreach my $original_doc   (@html_files)   {

        print "Processing $original_doc\n";
        my $new_file = "$new_folder$original_doc";

#        print "new_file variable  is : $new_file\n";

        print "\n";
        open (ORIGINAL, $original_doc);
        open TO_NEW_FILE, "> $new_file" or die "Cannot open $new_file for writing: $!";

        open TEMPLATE1, "< $template1" or die "Cannot read from $template1: $!";
        open TEMPLATE2, "< $template2" or die "Cannot read from $template2: $!";

 print TO_NEW_FILE <TEMPLATE1>;

        while (<ORIGINAL>) {
        if ( /^$StartTags/gm .. /^$EndTags/gm ) {

        print TO_NEW_FILE "$_";
                                                }
}
 print TO_NEW_FILE <TEMPLATE2>;

}

This seems to work with my first half of the template in part1 and the 2nd in part2
 
You've been busy:
If this is working for you, I'll only add some think-about-this type comments:
1) I don't think <>'s have to be escaped. So the following may be easier to read
Code:
my $StartTags  = "<td width=\"200\"><font face=\"Arial\" size=\"2\" color=\"Maroon\">";
2) Knowing that html accepts size="200" as well as size='200', it is possible that your source data could include it as well. Consider:
Code:
my $Q = '["' . "']" ; # this will represent ["'] both sets of quotes
my $StartTags  = "<td width=${Q}200${Q}><font face=${Q}Arial${Q} size=${Q}2${Q} color=${Q}Maroon${Q}>";
Some comments on the above code A) I know that the variable $Q isn't very descriptive, but the goal here is to make your definition for $StartTags easier to look at and modify if needed. It will be left up to the reader to substitute $Q with something like $SingleDoubleQuotePattern into the code and see if it makes it easier to understand. B: You'll also see that I'm using ${Q} versus $Q. This is only to help the parser distinguish between a variable that looks like $Q and one that looks like $Q200.

I'm seeing above that you're looking for a way to flag out non-matching files. One way would be to set 'flag' prior to reading the data (eg. $ValidData = 0), then under the if (/starttag..endtag/) statement, set it to 1. If you never execute the code under the 'if' statement, you can use this to delete the file
Code:
close(TO_NEW_FILE);
if ($ValidData == 0) {
   system "rm $new_file" ; # or "del $new_file" if DOS/WIN
};
FWIW - I have run into scenarios where the server os/hdwr combination would start trying to execute the 'rm/del' statement before the hard-drive had *fully* closed the file and I'd get errors. Adding a 'sleep 1' statement under the 'if' fixed that.

One last thing to consider:
Code:
open (TO_NEW_FILE, ">$new_file") or die "Cannot open $new_file for writing: $!";
The ()'s will help clarify to the parser that the code inside goes with the open command. There are some minimalists that will refute this. Just choose your own style.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top