Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing text between < > characters

Status
Not open for further replies.

Guest_imported

New member
Jan 1, 1970
0
0
0
I wrote this to get rid of html tags in a bunch of text files that are all named similarly. It seems to access the first file and get stuck. What is wrong with my method of opening the file and reading it? I'm pretty sure there's something wrong with my second "while".

#!/usr/bin/perl -w
use strict;

undef $/; #Makes it read a file at a time instead of the default line at a time.

my $x = 0;
my $filename2 = "exxonURL" . $x;
my $dollar_one = "";

while($x <= 365) {
$filename2 = &quot;exxonURL&quot; . $x;
$filename2 = $filename2 . &quot;\.htm&quot;;
open (TagDeletingStream, &quot;> $filename2&quot;) ||die &quot;Can't open file: $!&quot;;
while ( <> ) {
m/(.*)/gis;
$dollar_one = $1;
$dollar_one =~ s/<[^>]+.//gis;
print TagDeletingStream $dollar_one;
}

$x++;
close (TagDeletingStream);
}
 
The generally accepted convention for input / output streams is for them to be upper-case, ie. TAGDELETINGSTREAM, this will help other people debugging your code.

Why define $filename2 right at the beginning, you are just placing the exact same value into it on the first loop iteration. Why not just my localise $filename2 in the loop this would be safer and more efficient. Also where you define $filename2 in the loop why not right
Code:
$filename2 = &quot;exxonURL&quot; . $x . &quot;htm&quot;;
I don't understand why you are looping with $x, surely you only want to loop as many times as filenames are supplied!

I'm not sure what you are trying to achieve with your first reg-ex
Code:
/.*/gis
, this should be the same as simply assigning the whole variable.

I would use something like the following:

Code:
#!/usr/bin/perl -w
# Script to remove HTML tags. Beware if < > are present in text!
# ie <B>TEXT HERE</B> becomes: TEXT HERE
use strict;
my $x = 0;

foreach (@ARGV) {
  local $/= undef;               # Locally undefines $/, safer if you expand code later
  open (TDS,&quot;>exxonURL$x.htm&quot;);  # Open output file
  open (INPUT,&quot;<$_&quot;);         # Open input file
  my $whole_file = <INPUT>;      # Slurp input file
  $whole_file =~ s/<.*?>//gis;   # strip out HTML tags
  print TDS $whole_file;         # output modifed text
  close(INPUT);                  # close outfile
  $x++;                          # Increment iterator

Hope this helps,
Paul :)
 
Sorry I misssed the closing curly bracket on my previous post (mistake with cut & paste). Post should have read:

Why define $filename2 right at the beginning, you are just placing the exact same value into it on the first loop iteration. Why not just my localise $filename2 in the loop this would be safer and more efficient.
Also where you define $filename2 in the loop why not right
Code:
$filename2 = &quot;exxonURL&quot; . $x . &quot;htm&quot;;

I don't understand why you are looping with $x and the condition <=365, surely you only want to loop as many times as filenames are supplied!

I'm not sure what you are trying to achieve with your first reg-ex /.*/gis, this should be the same as simply assigning the whole variable.

I would use something like the following:

Hope this helps,
Paul :)

Code:
#!/usr/bin/perl -w
# Script to remove HTML tags. Beware if < > are present in text!
# ie <B>TEXT HERE</B> becomes: TEXT HERE
use strict;
my $x = 0;  # Looping variable

foreach (@ARGV) {
  local $/= undef;               # Locally undefines $/, safer if you expand code later
  open (TDS,&quot;>exxonURL$x.htm&quot;);  # Open output file
  open (INPUT,&quot;<$_&quot;);         # Open input file
  my $whole_file = <INPUT>;      # Slurp input file
  $whole_file =~ s/<.*?>//gis;   # strip out HTML tags
  print TDS $whole_file;         # output modifed text
  close(INPUT);                  # close outfile
  $x++;                          # Increment iterator
}
 
Nope --- you're not being lazy enough, nowhere near...

Have a look at the module:

HTML::parser

&quot;Objects of the HTML::parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.&quot; Mike
______________________________________________________________________
&quot;Experience is the comb that Nature gives us after we are bald.&quot;

Is that a haiku?
I never could get the hang
of writing those things.
 
Thanks, Paul, but if you run it it runs without errors but does nothing. And being that I don't know what you're doing with that <INPUT> thing I'm not sure why. Have any more helpful ideas?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top