Ramnarayan
Programmer
Hi folks,
I am trying to find the # of words in a xml file. The tricky thing here is we need to count the words which are not in the tags, but outside the tags. An example here is given:
<editor>Edited by
<persName>
<foreName>John</foreName>
<surname>Barry</surname>
</persName> and
<persName>
<foreName full="init" type="first">E.</foreName>
<foreName type="middle">Gene</foreName>
</persName>
</editor>
Now, I am trying to count the words "Edited by", "John", "Barry", "and", "E." and "Gene" from the above list. As you can see anything starting with "<" and ending with ">" can be called as a tag like <editor>...</editor>.
Here is the script I wrote. But it is counting only one word i.e only "Edited by". Infact it should count two words here. Where am I wrong!
$line =~ s,<[/]?.*?>,,g;
if ($line =~ m/(\s\b(.+)\b\s)/g)
{
++$wc if ( defined $1);
}
else
{
fatal ("cannot pull word from $.:\n $!");
}
print "Word count: $wc\n";
I am trying to find the # of words in a xml file. The tricky thing here is we need to count the words which are not in the tags, but outside the tags. An example here is given:
<editor>Edited by
<persName>
<foreName>John</foreName>
<surname>Barry</surname>
</persName> and
<persName>
<foreName full="init" type="first">E.</foreName>
<foreName type="middle">Gene</foreName>
</persName>
</editor>
Now, I am trying to count the words "Edited by", "John", "Barry", "and", "E." and "Gene" from the above list. As you can see anything starting with "<" and ending with ">" can be called as a tag like <editor>...</editor>.
Here is the script I wrote. But it is counting only one word i.e only "Edited by". Infact it should count two words here. Where am I wrong!
$line =~ s,<[/]?.*?>,,g;
if ($line =~ m/(\s\b(.+)\b\s)/g)
{
++$wc if ( defined $1);
}
else
{
fatal ("cannot pull word from $.:\n $!");
}
print "Word count: $wc\n";