Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

regex pita 2

Status
Not open for further replies.

jrig

Technical User
Sep 5, 2006
36
US
Code:
$ln=~ s/\.\s+/\.  /g;
I'm looping through lines in a file, the aim is to ensure all sentences are separated with 2 spaces. The period isn't necessarily at the end of a line.

I would like to modify the code so it will not pad 'Mr. | Dr.' or really any periods which aren't the end of a sentence
(that may be a bit much). I realize that
Code:
if ($ln !~/[MD]r\./){$ln=~ s/\.\s+/\.  /g;}
would do the trick, but seems pretty inefficient/incorrect since a line could have 'Dr.' as well as a terminating '.' in it. Is there a better way to accomplish this? thanks-
 
The primary concern here is catching too many cases. So I would suggest that you narrow down your logic, specifically to the case of a sentence followed by only one space.

To avoid honorifics, you basically can take two approaches. You can specifically list all the ones that you want to avoid, or you could instead simply only change sentence endings in words longer than three characters. This would therefore catch things like St. and Ave. as well as honorifics.

Code:
# Simpliest Regex
$ln =~ s/\b\. \b/.  /g

# Avoid Specific Honorifics and Abbreviations
$ln =~ s/(?<!Dr|Mr|Mrs|Ms|St|Ave)\b\. \b/.  /g

# Avoid Generic Honorifics (only catches words 4 characters or longer)
$ln =~ s/(?<=\w{4})\b\. \b/.  /g

This should give you some ideas. Definitely avoid using \s+. That's likely to catch way too many cases. If you really want to catch the case of three spaces, look for it specifically.
 
The second option won't work, as you're trying to use a variable-length lookbehind.

You could use multiple lookbehinds though, since they're zero-width:
Code:
$ln =~ s/(?<!Dr)(?<!Mr)\b\. \b/.  /g;
 
Good catch ishnid. I also forgot to make those "specific" word checks, instead of possibly word endings. So, adding a little regex construction step, that changes my examples to the following:

Code:
# Simpliest Regex
$ln =~ s/\b\. \b/.  /g

# Avoid Specific Honorifics and Abbreviations
my $noAbbrevs = join '', map {"(?<!\\b$_)"} qw(Dr Mr Mrs Ms St Ave)
$ln =~ s/$noAbbrevs\b\. \b/.  /g

# Avoid Generic Honorifics (only catches words 4 characters or longer)
$ln =~ s/(?<=\w{4})\b\. \b/.  /g
 
What about splitting sentences that end with ? and !

Even splitting on [\.?!]\s+[A-Z] on the idea that a sentence terminator followed by one or more spaces and a capital letter might work falls down on the "Dr. Watson" example.

Whatever you do there's going to be a big trade off between false positives and false neagtives.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
stevexff said:
Whatever you do there's going to be a big trade off between false positives and false negatives.

That's slightly alarmist. There is a trade-off, yes, but only a minor one.

I would argue that false positives are of greater concern than false negatives. If you don't catch all cases, that's ok as at least you're not making things worse. That's why I narrowed down the logic in my regex to only a single space boundary and either a 4 character last word limit or specific filtering for abbreviations and honorifics.

If you want to catch other punctuation characters, then simply add another regex for those.

Code:
$ln =~ s/\b(\!|\?) \b/$1  /g

It's not perfect, no. But it will make things considerably better. And is definitely more complete than the regex that he was using.
 
Thanks for all the great responses, learned just as much by reading them. I guess a regex's power is directly proportional to it's complexity!
Here's what I'm going with
Code:
$ln =~ s/(?<!Dr)(?<!Mr)(?<!Mrs)\b(\.|\!|\?) \b/$1  /g;
. I realize it's expensive, imperfect (and ugly?), but given the size and requirements of the app, it'll work.
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top