I have a rough HTML file (from Word 98) that I'm trying to convert into XML. I've split it into lines on </P> tags, then I'm taking each line in turn, stripping off the tags at the start, then setting a variable $italic if a <I> or </I> tag is found, so:
My problem is that at some point, about quarter-way through the file, $1 suddenly starts getting set to null - here's the extract of the RPT file where it goes:
As you can see, the tags are still getting stripped off, but $1 isn't getting set to their value, and I get a whole load of 'uninitialized value' errors. I've tried with $& as well, same problem.
Any ideas? I've never seen anything like this before. I've checked for strange characters at the point where it disappears, and there's nothing. I'm stumped. Any help or advice would be most gratefully accepted.
Code:
foreach $line (@lines) {
$line =~ s|^\s+||;
print RPT "$line\n";
while ($line =~ s|^(<[^>]*>)||g) {
$tag = $1;
if ($tag eq "<I>") {
$italic = 1;
print RPT "<I> found - italic 1\n";
} elsif ($tag eq "</I>") {
$italic = 0;
print RPT "</I> found - italic 0\n";
} else {
print RPT "$tag\n";
}
}
print RPT "$italic\t$line\n";
....
}
My problem is that at some point, about quarter-way through the file, $1 suddenly starts getting set to null - here's the extract of the RPT file where it goes:
Code:
<P>Cold weather in autumn or winter that is dry and pleasant is sometimes described as <B>crisp</B>:</P>
<P>
0 Cold weather in autumn or winter that is dry and pleasant is sometimes described as <B>crisp</B>:
<I><P>We walked through the forest on a <B>crisp</B> autumn day.</P>
<I> found - italic 1
1 We walked through the forest on a <B>crisp</B> autumn day.
<B><P>complain</P>
1 complain
<P>Other ways of saying</I> complain</P></B>
1 Other ways of saying</I> complain</B>
As you can see, the tags are still getting stripped off, but $1 isn't getting set to their value, and I get a whole load of 'uninitialized value' errors. I've tried with $& as well, same problem.
Any ideas? I've never seen anything like this before. I've checked for strange characters at the point where it disappears, and there's nothing. I'm stumped. Any help or advice would be most gratefully accepted.