Puzzling word count after removing tags

Ramnarayan · Jul 21, 2004

Hi folks,

I am trying to find the # of words in a xml file. The tricky thing here is we need to count the words which are not in the tags, but outside the tags. An example here is given:

<editor>Edited by
<persName>
<foreName>John</foreName>
<surname>Barry</surname>
</persName> and
<persName>
<foreName full="init" type="first">E.</foreName>
<foreName type="middle">Gene</foreName>
</persName>
</editor>

Now, I am trying to count the words "Edited by", "John", "Barry", "and", "E." and "Gene" from the above list. As you can see anything starting with "<" and ending with ">" can be called as a tag like <editor>...</editor>.

Here is the script I wrote. But it is counting only one word i.e only "Edited by". Infact it should count two words here. Where am I wrong!

$line =~ s,<[/]?.*?>,,g;
if ($line =~ m/(\s\b(.+)\b\s)/g)
{
++$wc if ( defined $1);
}
else
{
fatal ("cannot pull word from $.:\n $!");
}
print "Word count: $wc\n";

PaulTEG · Jul 22, 2004

Why write your own XML parser, there are a few available on

http://search.cpan.org

HTH
--Paul

It's important in life to always strike a happy medium, so if you see someone with a crystal ball, and a smile on their face ...

figmatalan · Jul 22, 2004

It's not pretty but this works:

$str = "
<editor>Edited by
<persName>
<foreName>John</foreName>
<surname>Barry</surname>
</persName> and
<persName>
<foreName full=\"init\" type=\"first\">E.</foreName>
<foreName type=\"middle\">Gene</foreName>
</persName>
</editor>
";

print "$str\n\n";

$str =~ s/<.*?>//g;
$str =~ s/<\/.*?>//g;
$str =~ s/^\s+//;

@words = split (/\s+/, $str);

print "words = ".@words."\n";

$count = 0;
foreach (@words) {
$count++;
print "$count - $_\n";
}

bardley · Jul 22, 2004

Or, if you want to do it as it comes in from the file,

Code:

$wc = 0;
while (<XMLFILE>) {
    chomp;
    s!\s*</?.+?/?>\s*!!g;
    if (/\w/) {  # quick sanity check for any non-whitespace character
        @words = split /\s+/;
        $wc += scalar @words;
    }
}

Also, I could be wrong, but I'm pretty sure strict xml doesn't allow things like

<tag1>someval<tag2></tag2></tag1>

so that may make using an xml parser difficult.

Brad Gunsalus
bardley90@hotmail.com

laserBeam · Jul 22, 2004

When you use the /g , the expression returns a list not a scalar value anymore since you can have multiple matches.

Code:

  {
    $line =~ s,<[/]?.*?>,,g;
    @a = ($line =~ m/\w+/g);  
    $wc += scalar(@a); # or $wc += @a;
  }
  print "Word count: $wc\n";

or if you want to be very fancy ;-), you can use an expansion ...

Code:

  {
    $line =~ s,<[/]?.*?>,,g;
    $wc += @{[($line =~ /\w+/g)]};
  }
  print "Word count: $wc\n";

Cheers!

duncdude · Jul 23, 2004

why not do the opposite of finding the words you want... by rejecting the ones you don't want

Code:

[b]#!/usr/bin/perl[/b]

undef $/;

$xml = <DATA>;

$xml =~ s|<[^>]+>||g;

print $xml;

[blue]__DATA__
<editor>Edited by
  <persName>
    <foreName>John</foreName>
    <surname>Barry</surname>
  </persName> and
  <persName>
    <foreName full="init" type="first">E.</foreName>
    <foreName type="middle">Gene</foreName>
  </persName>
</editor>[/blue]

Kind Regards
Duncan

Ramnarayan · Jul 23, 2004

Thanks for everyone to give their responses. However Duncan's response is very good.
$xml =~ s|<[^>]+>||g;
But you need to pass the output of the $xml to a array and count the words. This worked correctly compared to the pattern search given by others ($xml =~ s|<[^>]+>||g

As for Paul, Bardley is right. XML:

arser is hard when you need to calculate the # of words and other indepth analysis of tagging. XML:

ARSER really works if when you need to work on the parsing of XML files with certain limitations.

duncdude · Jul 23, 2004

Thanks dude!

Kind Regards
Duncan

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Puzzling word count after removing tags

Ramnarayan

Programmer

PaulTEG

Technical User

figmatalan

Programmer

bardley

Programmer

laserBeam

Programmer

duncdude

Programmer

Ramnarayan

Programmer

duncdude

Programmer

Similar threads

Part and Inventory Search

Sponsor