Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Sentence RegExp

Status
Not open for further replies.

Aarem

Programmer
Oct 11, 2004
69
US
What is the standard RegExp for a sentence?
 
Start by defining what you consider a regular sentence, and remember that the regex itself can be defined in a sentence, so Capital letter, just about anything printable, followed by a period, or full stop.
Code:
/^[A-Z](*)\.$/

But I wouldn't swear by that, because you could be talking about finance and 3 dollars, sixty cents would make muck of that prematurely $3.60.

Would you care to narrow the specification?

--Paul

Serious about the regex though, not my strong suit

Nancy Griffith - songstress extraordinaire,
and composer of the snipers anthem "From a distance ...
 
I'm not aware that there is a standard regex for a sentence.
As Paul says above, how do you define a sentence?

Spurred on by your post, though, I came up with the following, which works with the data I used, anyway. I've defined a sentence as any string of chars terminating in a ., ?, or !, and followed by 2 or more spaces or the end of the string.

If anyone has other ideas, I'd be interested to hear about them.
Code:
#!perl
use strict;
use warnings;

my $data;
{
    local $/ = "";
    $data = <DATA>;
}
$data =~ s/\n(?!$)/ /sg;
[b]my @sentences = $data =~ /(.+?[.?!])(?:\s\s+|$)/g;[/b]
for (my $i=0; $i<@sentences; $i++) {
    print "$i: $sentences[$i]\n";
}

__DATA__
There are 5 rooms in Mr. Tanaka's house.  On the first floor are the living
room and the kitchen.  Outside the living room window, we see a garden.  In
the garden, we see Mr. Tanaka playing ball with his son.  On the second floor
are the master bedroom, a bathroom, and the children's bedroom.  There is no
TV in the children's bedroom, but there are a stereo and a computer.
Output:
Code:
0: There are 5 rooms in Mr. Tanaka's house.
1: On the first floor are the living room and the kitchen.
2: Outside the living room window, we see a garden.
3: In the garden, we see Mr. Tanaka playing ball with his son.
4: On the second floor are the master bedroom, a bathroom, and the children's bedroom.
5: There is no TV in the children's bedroom, but there are a stereo and a computer.


 
I am looking for a sentence that matches the following:

-Case-insensitive
-Ends with period, semicolon, question mark, or exclamation point.
-Has a whitespace character after the end punctuation

I would also like to know if the sentence contains a colon, or whatever I specify. And if it does, then add a <SPAN> to the beginning and a </SPAN> to the end of the sentence.
 
My earlier code can be easily adapted to do this. I changed the sentence terminators to include a semicolon and decreased the minimum number of whitespaces after a terminator to 1, and added a sub. Note that this new definition of a sentence gives us different results, though, as some sentences in my example contain a period followed by a space. Your data doesn't, hopefully.
Code:
#!perl
use strict;
use warnings;

my $data;
{
    local $/ = "";
    $data = <DATA>;
}
$data =~ s/\n(?!$)/ /sg;
[b]my @sentences = $data =~ /(.+?[.;?!])(?:\s+|$)/g;[/b]
for (my $i=0; $i<@sentences; $i++) {
    [b]$sentences[$i] = process($sentences[$i], ":");[/b]
    print "$i: $sentences[$i]\n";
}

[b]sub process {
    my ($string, $pattern) = @_;
    $string =~ /$pattern/? "<SPAN>".$string."</SPAN>": $string;
}[/b]

__DATA__
There are 5 rooms in Mr. Tanaka's house.  On the first floor: the living
room and the kitchen.  Outside the living room window, we see a garden.  In
the garden, we see Mr. Tanaka playing ball with his son.  On the second floor:
the master bedroom, a bathroom, and the children's bedroom.  There is no
TV in the children's bedroom; but there are a stereo and a computer.
Output:
Code:
0: There are 5 rooms in Mr.
1: Tanaka's house.
2: <SPAN>On the first floor: the living room and the kitchen.</SPAN>
3: Outside the living room window, we see a garden.
4: In the garden, we see Mr.
5: Tanaka playing ball with his son.
6: <SPAN>On the second floor: the master bedroom, a bathroom, and the children's bedroom.</SPAN>
7: There is no TV in the children's bedroom;
8: but there are a stereo and a computer.


 
Okay. Now I need some more help. I need to figure out how to stop it from highlighting a sentence twice because it contains two clues. Can your help?
 
You can also do it like this

Code:
#!perl

use strict;
use warnings;

my $data;
{
    local $/ = "";
    $data = <DATA>;
}

$data =~ s/\n/ /g;
my @sent = split(/(\.|\?|\!)\s/,$data);

for(my $i=0;$i<$#sent-1;$i+=2){
	if ($sent[$i] =~ /:/){
		print "$i: <SPAN>$sent[$i]$sent[$i+1]</SPAN>\n";
	}else{
		print "$i: $sent[$i]$sent[$i+1]\n";
	}

}

__DATA__
There are 5 rooms in Mr Tanaka's house. On the first floor: the living
room and the kitchen. Outside the living room window, we see a garden. In
the garden, we see Mr Tanaka playing ball with his son? On the second floor:
the master bedroom, a bathroom, and the children's bedroom! There is no
TV in the children's bedroom; but there are a stereo and a computer.
Code:
Sentence:   There are 5 rooms in Mr Tanaka's house.
Sentence:   <SPAN>On the first floor: the living room and the kitchen.</SPAN>
Sentence:   Outside the living room window, we see a garden.
Sentence:   In the garden, we see Mr Tanaka playing ball with his son?
Sentence:   <SPAN>On the second floor: the master bedroom, a bathroom, and the children's bedroom!</SPAN>
Sentence:   There is no TV in the children's bedroom; but there are a stereo and a computer.
but if the sentence contains one space after the end and you have a sentence like this
'There are 5 rooms in Mr. Tanaka's house.'
then i dont think that can be done. Cause it will divide the sentence into two different ones.
 
sorry
in the print "$i:..........
change the '$i:' with this 'Sentence:'
I appologise for this

But if you really want a liner

then do it like this
Code:
my $j = 1;
for(my $i=0;$i<$#sent-1;$i+=2){
    if ($sent[$i] =~ /:/){
        print "$j: <SPAN>$sent[$i]$sent[$i+1]</SPAN>\n";
    }else{
        print "$j: $sent[$i]$sent[$i+1]\n";
    }
$j++;
}
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top