Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex find sentence 3

Status
Not open for further replies.

three57m

Programmer
Jun 13, 2006
202
US
Hi

Does anyone know of a regular expression pattern to find sentences.
I came up with this one based on several web searches. However it only finds one sentence even though i have global set to true.


"[a-zA-Z].*([.!?]\s|[.!?])"

I am using like so:
Code:
MyRegExp.IgnoreCase = True
MyRegExp.Global = True
MyRegExp.Pattern = "[a-zA-Z].*([.!?]\s|[.!?])" 
Set MyMatches = MyRegExp.Execute(RichTextBox1.SelText)
SentenceCounter = MyMatches.Count

any help is appreciated thank you.

Ron
 
Hi,

How about this pattern?
Code:
"[a-zA-Z].*?[\.!\?]"
The ? makes the star lazy so it will take the first match it finds (rather than the last as your current pattern will). You also need to remember your \ as escape characters as most of your options are special characters in Regular Expressions (the . and the ?)

Hope this helps

HarleyQuinn
---------------------------------
The most overlooked advantage to owning a computer is that if they foul up there's no law against wacking them around a little. - Joe Martin

Get the most out of Tek-Tips, read FAQ222-2244 before posting.
 
I'll take a stab at it and guess that it's failing because * is greedy, so the pattern matches your entire text.

Let's analyze the pattern:
[ol]
[li][a-zA-Z] is any lower/upper case character[/li]
[li].* is any number of any character[/li]
[li]([.!?]\s|[.!?]) is sentence-ending punctuation, first optionally followed by whitespace[/li]
[/ol]

We're already guessing that the problem is at 2, but I don't know whether the VBScript implementation recognizes \s, so keep that in mind as a possible point of failure.

Now let's try to fix the pattern:
[ol]
[li][a-z] start with any alpha character (you're already ignoring case)[/li]
[li][^\.!?]* is any number of characters that aren't punctuation. . is special, so it must be escaped. I don't know whether ! or ? are special in this implementation.[/li]
[li][\.!?] is sentence-ending punctuation. Let's try it first without bothering about whitespace[/li]
[/ol]

So we have
"[a-z][^\.!?]*[\.!?]"

Start with that and see what you get. (I'm obviously not testing this in VB.)
 
harebrain said:
I'll take a stab at it and guess that it's failing because * is greedy, so the pattern matches your entire text.
Yep [smile]

harebrain said:
I'm obviously not testing this in VB.
It works though [wink]

Cheers

HarleyQuinn
---------------------------------
The most overlooked advantage to owning a computer is that if they foul up there's no law against wacking them around a little. - Joe Martin

Get the most out of Tek-Tips, read FAQ222-2244 before posting.
 
And your explaination of the pattern (and why ours work\ the op's didn't) was a lot better than mine [wink]

HarleyQuinn
---------------------------------
The most overlooked advantage to owning a computer is that if they foul up there's no law against wacking them around a little. - Joe Martin

Get the most out of Tek-Tips, read FAQ222-2244 before posting.
 
I should have guessed that \? would be needed. And, before we break our arms patting each other on the back :) I think both will fail if there's an ellipsis in the text.
 
Technically they'll both still work (as they will still pick out the sentences correctly, and the OP seems to want counts) the match just won't include the last "..".

If it wasn't about 1am here I'd write one that does but my bed is calling me [sleeping]

HarleyQuinn
---------------------------------
The most overlooked advantage to owning a computer is that if they foul up there's no law against wacking them around a little. - Joe Martin

Get the most out of Tek-Tips, read FAQ222-2244 before posting.
 
Thank you both, however, the example paragraph below
returns 4 instead of the desired 3.

Test paragraph:
Storytelling has existed as long as humanity has had language. It's the world of myth, of history, of the imagination...it explains life. Every culture has its stories, legends, and every culture has its storytellers, often revered figures with the magic of the tale in their voices and minds.
 
Sentence can be a real complicated object to define. With assumption of [1] period, exclaimation and question marks followed by blank space, and [2] upper case first letter, you can do this.
[tt]
with MyRegExp
.Global = True
.ignorecase=false
.Pattern = "[A-Z][\w\W]*?(\.|!|\?)(\s+|$)"
end with
[/tt]
This captured the trailing space(s) as match. You can use look forward (supported) to avoid it, or use trim to trim the match results.
 
tsuji

This suits the need perfectly.

Thank You.
 
>Watch out for numbers, though.


could you elaborate?
 
myself said:
Now let's try to fix the pattern:

1. [a-z] start with any alpha character (you're already ignoring case)

Numeric characters at the beginning of a "sentence" as we've defined it will be ignored. This can be easily remedied by changing [a-z] to [a-z0-9]. If you expect any other punctuation or special characters at the beginning of a sentence, perhaps a quote or parenthesis, you can throw that in there too. Just keep in mind that some of those characters, such as a paren, are special in regexes and will need to be escaped with a \.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top