Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Puzzling word count after removing tags

Status
Not open for further replies.

Ramnarayan

Programmer
Jan 15, 2003
56
US
Hi folks,

I am trying to find the # of words in a xml file. The tricky thing here is we need to count the words which are not in the tags, but outside the tags. An example here is given:

<editor>Edited by
<persName>
<foreName>John</foreName>
<surname>Barry</surname>
</persName> and
<persName>
<foreName full="init" type="first">E.</foreName>
<foreName type="middle">Gene</foreName>
</persName>
</editor>

Now, I am trying to count the words "Edited by", "John", "Barry", "and", "E." and "Gene" from the above list. As you can see anything starting with "<" and ending with ">" can be called as a tag like <editor>...</editor>.

Here is the script I wrote. But it is counting only one word i.e only "Edited by". Infact it should count two words here. Where am I wrong!

$line =~ s,<[/]?.*?>,,g;
if ($line =~ m/(\s\b(.+)\b\s)/g)
{
++$wc if ( defined $1);
}
else
{
fatal ("cannot pull word from $.:\n $!");
}
print "Word count: $wc\n";
 
Why write your own XML parser, there are a few available on
HTH
--Paul

It's important in life to always strike a happy medium, so if you see someone with a crystal ball, and a smile on their face ...
 
It's not pretty but this works:

$str = "
<editor>Edited by
<persName>
<foreName>John</foreName>
<surname>Barry</surname>
</persName> and
<persName>
<foreName full=\"init\" type=\"first\">E.</foreName>
<foreName type=\"middle\">Gene</foreName>
</persName>
</editor>
";

print "$str\n\n";

$str =~ s/<.*?>//g;
$str =~ s/<\/.*?>//g;
$str =~ s/^\s+//;

@words = split (/\s+/, $str);

print "words = ".@words."\n";

$count = 0;
foreach (@words) {
$count++;
print "$count - $_\n";
}
 
Or, if you want to do it as it comes in from the file,

Code:
$wc = 0;
while (<XMLFILE>) {
    chomp;
    s!\s*</?.+?/?>\s*!!g;
    if (/\w/) {  # quick sanity check for any non-whitespace character
        @words = split /\s+/;
        $wc += scalar @words;
    }
}

Also, I could be wrong, but I'm pretty sure strict xml doesn't allow things like

<tag1>someval<tag2></tag2></tag1>

so that may make using an xml parser difficult. :)

Brad Gunsalus
bardley90@hotmail.com
 
When you use the /g , the expression returns a list not a scalar value anymore since you can have multiple matches.

Code:
  {
    $line =~ s,<[/]?.*?>,,g;
    @a = ($line =~ m/\w+/g);  
    $wc += scalar(@a); # or $wc += @a;
  }
  print "Word count: $wc\n";

or if you want to be very fancy ;-), you can use an expansion ...

Code:
  {
    $line =~ s,<[/]?.*?>,,g;
    $wc += @{[($line =~ /\w+/g)]};
  }
  print "Word count: $wc\n";

Cheers!
 
why not do the opposite of finding the words you want... by rejecting the ones you don't want

Code:
[b]#!/usr/bin/perl[/b]

undef $/;

$xml = <DATA>;

$xml =~ s|<[^>]+>||g;

print $xml;

[blue]__DATA__
<editor>Edited by
  <persName>
    <foreName>John</foreName>
    <surname>Barry</surname>
  </persName> and
  <persName>
    <foreName full="init" type="first">E.</foreName>
    <foreName type="middle">Gene</foreName>
  </persName>
</editor>[/blue]


Kind Regards
Duncan
 
Thanks for everyone to give their responses. However Duncan's response is very good.
$xml =~ s|<[^>]+>||g;
But you need to pass the output of the $xml to a array and count the words. This worked correctly compared to the pattern search given by others ($xml =~ s|<[^>]+>||g;)

As for Paul, Bardley is right. XML::parser is hard when you need to calculate the # of words and other indepth analysis of tagging. XML::pARSER really works if when you need to work on the parsing of XML files with certain limitations.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top