Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Get 1st sentence with regex

Status
Not open for further replies.

TipGiver

Programmer
Sep 1, 2005
1,863
Hi everyone.

I would like to get the 1st sentence of a variable. This php variable actually has html text saved in it. Previously i have use the combination of the functions strpos and substr to get the text untill the '.'
The problem with that is that if for example there is an image tag before the first real period, the image wont show.

$var = "Blah blah <img src=f1/f2/f3/image.gif/> blah blah."

The first period is the one at the image's name.
So the substr($var, 0, strpos($var, '.'))
will return "Blah blah <img src=f1/f2/f3/image"
and the image wont show (after echo-ing the var).

Can anyone help fix that?
 
Hi jpadie
I have thought of it too. The problems with that are
- I may make a mistake and not hit the spacebar, so ". " will not be found.
- Dots and spaces are allowed in general for urls. Hardly anyone will create any url containing ". "
- Most important, if the user has typed some sentences that ofcource end with "." (no space after) and then hits enter (<p> html tags), there wont exist any ". " at all.

As you will have understood, the $var contains data tha someone has typed in (and stored somewhere, in a database for example).
 
Hi

TipGiver said:
- I may make a mistake and not hit the spacebar, so ". " will not be found.
'. ' is quite restrictive. I would use '[.!?][[:space:]]' regular expression, which would mean any end of sentence punctuation followed by any whitespace character.
TipGiver said:
- Dots and spaces are allowed in general for urls. Hardly anyone will create any url containing ". "
Spaces are not allowed in URLs, they have to be URL-encoded as '+' or '%20'.
TipGiver said:
- Most important, if the user has typed some sentences that ofcource end with "." (no space after) and then hits enter (<p> html tags), there wont exist any ". " at all.
If possible I used to process the text before converting it to HTML. In that case my previous regular expression will work. Otherwise you can change it to '[.!?]([[:space:]]|<p>)' or even better to '[.!?]([[:space:]]|</?(p|br)[[:space:]]?/?>)' .

Feherke.
 
If you really want the image as well, I'm afraid it may not be possible without telling what you want in specific circumstances. For instance, the alt attribute of an image may easily contain a few sentences. I would strip all HTML from the text (there is a function to do that in PHP, but I do not remember what it is called) and search that.

Even so, abbreviations, names and acronyms can spoil quite a lot. If you abbreviate General Pardon (Holland's most famous general) to "gen. Pardon", you have a situation that can only be interpreted by knowing the meaning as not to be the end of a sentence.


+++ Despite being wrong in every important aspect, that is a very good analogy +++
Hex (in Darwin's Watch)
 
Thank you all for your answers. Hm the abbreviations can be a problem. Well maybe then instead of getting the 1st sentence, i should better get the first 100 words !

I'll think about it, what to do.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top