regex pita 2

jrig · Jan 26, 2007

Code:

$ln=~ s/\.\s+/\.  /g;

I'm looping through lines in a file, the aim is to ensure all sentences are separated with 2 spaces. The period isn't necessarily at the end of a line.

I would like to modify the code so it will not pad 'Mr. | Dr.' or really any periods which aren't the end of a sentence
(that may be a bit much). I realize that

Code:

if ($ln !~/[MD]r\./){$ln=~ s/\.\s+/\.  /g;}

would do the trick, but seems pretty inefficient/incorrect since a line could have 'Dr.' as well as a terminating '.' in it. Is there a better way to accomplish this? thanks-

MillerH · Jan 26, 2007

The primary concern here is catching too many cases. So I would suggest that you narrow down your logic, specifically to the case of a sentence followed by only one space.

To avoid honorifics, you basically can take two approaches. You can specifically list all the ones that you want to avoid, or you could instead simply only change sentence endings in words longer than three characters. This would therefore catch things like St. and Ave. as well as honorifics.

Code:

# Simpliest Regex
$ln =~ s/\b\. \b/.  /g

# Avoid Specific Honorifics and Abbreviations
$ln =~ s/(?<!Dr|Mr|Mrs|Ms|St|Ave)\b\. \b/.  /g

# Avoid Generic Honorifics (only catches words 4 characters or longer)
$ln =~ s/(?<=\w{4})\b\. \b/.  /g

This should give you some ideas. Definitely avoid using \s+. That's likely to catch way too many cases. If you really want to catch the case of three spaces, look for it specifically.

ishnid · Jan 26, 2007

The second option won't work, as you're trying to use a variable-length lookbehind.

You could use multiple lookbehinds though, since they're zero-width:

Code:

$ln =~ s/(?<!Dr)(?<!Mr)\b\. \b/.  /g;

MillerH · Jan 26, 2007

Good catch ishnid. I also forgot to make those "specific" word checks, instead of possibly word endings. So, adding a little regex construction step, that changes my examples to the following:

Code:

# Simpliest Regex
$ln =~ s/\b\. \b/.  /g

# Avoid Specific Honorifics and Abbreviations
my $noAbbrevs = join '', map {"(?<!\\b$_)"} qw(Dr Mr Mrs Ms St Ave)
$ln =~ s/$noAbbrevs\b\. \b/.  /g

# Avoid Generic Honorifics (only catches words 4 characters or longer)
$ln =~ s/(?<=\w{4})\b\. \b/.  /g

stevexff · Jan 26, 2007

What about splitting sentences that end with ? and !

Even splitting on [\.?!]\s+[A-Z] on the idea that a sentence terminator followed by one or more spaces and a capital letter might work falls down on the "Dr. Watson" example.

Whatever you do there's going to be a big trade off between false positives and false neagtives.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

MillerH · Jan 26, 2007

stevexff said:
Whatever you do there's going to be a big trade off between false positives and false negatives.

That's slightly alarmist. There is a trade-off, yes, but only a minor one.

I would argue that false positives are of greater concern than false negatives. If you don't catch all cases, that's ok as at least you're not making things worse. That's why I narrowed down the logic in my regex to only a single space boundary and either a 4 character last word limit or specific filtering for abbreviations and honorifics.

If you want to catch other punctuation characters, then simply add another regex for those.

Code:

$ln =~ s/\b(\!|\?) \b/$1  /g

It's not perfect, no. But it will make things considerably better. And is definitely more complete than the regex that he was using.

jrig · Jan 29, 2007

Thanks for all the great responses, learned just as much by reading them. I guess a regex's power is directly proportional to it's complexity!
Here's what I'm going with

Code:

$ln =~ s/(?<!Dr)(?<!Mr)(?<!Mrs)\b(\.|\!|\?) \b/$1  /g;

. I realize it's expensive, imperfect (and ugly?), but given the size and requirements of the app, it'll work.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

regex pita 2

jrig

Technical User

MillerH

Programmer

ishnid

Programmer

MillerH

Programmer

stevexff

Programmer

MillerH

Programmer

jrig

Technical User

Similar threads

Part and Inventory Search

Sponsor