Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

$1 not set by regexp in loop

Status
Not open for further replies.

domster

Programmer
Oct 23, 2003
30
0
0
GB
I have a rough HTML file (from Word 98) that I'm trying to convert into XML. I've split it into lines on </P> tags, then I'm taking each line in turn, stripping off the tags at the start, then setting a variable $italic if a <I> or </I> tag is found, so:

Code:
foreach $line (@lines) {
	$line =~ s|^\s+||;
	print RPT "$line\n";
	while ($line =~ s|^(<[^>]*>)||g) {
		$tag = $1;
		if ($tag eq "<I>") {
			$italic = 1;
			print RPT "<I> found - italic 1\n";
		} elsif ($tag eq "</I>") {
			$italic = 0;
			print RPT "</I> found - italic 0\n";
		} else {
			print RPT "$tag\n";
		}
	}
	
	print RPT "$italic\t$line\n";
....
}

My problem is that at some point, about quarter-way through the file, $1 suddenly starts getting set to null - here's the extract of the RPT file where it goes:

Code:
<P>Cold weather in autumn or winter that is dry and pleasant is sometimes described as <B>crisp</B>:</P>
<P>
0	Cold weather in autumn or winter that is dry and pleasant is sometimes described as <B>crisp</B>:
<I><P>We walked through the forest on a <B>crisp</B> autumn day.</P>
<I> found - italic 1

1	We walked through the forest on a <B>crisp</B> autumn day.
<B><P>complain</P>


1	complain
<P>Other ways of saying</I> complain</P></B>

1	Other ways of saying</I> complain</B>

As you can see, the tags are still getting stripped off, but $1 isn't getting set to their value, and I get a whole load of 'uninitialized value' errors. I've tried with $& as well, same problem.

Any ideas? I've never seen anything like this before. I've checked for strange characters at the point where it disappears, and there's nothing. I'm stumped. Any help or advice would be most gratefully accepted.
 
Not sure if this is it - but the first thing I notice is that you probably don't want a greedy quantifier. Change:

s|^(<[^>]*>)||g

To

s|^(<[^>]*?>)||g

--jim
 
Makes no difference, I'm afraid. I thought the non-greedy quantifier was only when using wild-card characters like . ? By restricting the regex to search for everything that isn't a closing angle bracket, I'm doing away with the need for the question mark.
 
Whew, I am rusty. You are right. If I have a break today I'll copy your test data and try to debug it, but I'm sure someone will beat me to it.

Sorry about the uselessnses of my previous post.

--jim
 
Actually it workd without the chomp also.

You can try checking if there are any special characters that word introduced into the input file.

When i copied the input into a normal text file. it workd for me..



I just commented the prints.
foreach $line (@lines) {
#chomp($line);
$italic="";
$line =~ s|^\s+||;
print "Main: $line\n";
while ($line =~ s|^(<[^>]*>)||g) {
$tag = $1;
if ($tag eq "<I>") {
$italic = 1;
print "1C; <I> found - italic 1\n";
} elsif ($tag eq "</I>") {
$italic = 0;
print "2C; </I> found - italic 0\n";
} else {
print "3C; $tag\n";
}
}

print "last: $italic\t$line\n";

}
 
It worked for me up to a point in the file, then stopped working - couldn't see anything in the file that would break it. I've got around it now by changing the s/// to a m// and removing the tag with with substr:

Code:
foreach $line (@lines) {
	$line =~ s|^\s+||;
	print RPT "$line\n";
	$bold = 0;
	while ($line =~ m|^(<.*?>)|g) {
		$tag = $&;
		if ($tag eq "<I>") {
			$italic = 1;
			print RPT "<I> found - italic 1\n";
		} elsif ($tag eq "</I>") {
			$italic = 0;
			print RPT "</I> found - italic 0\n";
		} elsif ($tag eq "<B>") {
			$bold = 1;
			print RPT "<B>\n";
		} elsif ($tag eq "</B>") {
			$bold = 0;
			print RPT "<B>\n";
		} else {
			print RPT "$tag\n";
		}
		substr($line, 0, length($tag)) = "";
	}
...
}

Very very weird, although I remember another instance when m// was more robust than s///. I'd be very interested if anyone can shed any more light on this.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top