Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Given Position in string, find 1st word boundary before & after 1

Status
Not open for further replies.

sen5241b

IS-IT--Management
Sep 27, 2007
199
US
What is the easiest way, given a position in a string, to find the first word boundary before & after the position?

I've tried searching for whitespaces in both directions but this seems clunky and inelegant and I've had trouble finding a unicode list of ALL possible whitespaces.
 
I've been using these characters:

Code:
$WhitespaceArray = array("\x20", "\x09", "\x00", "\x0A", "\x0D", "\x0B", "\xA0");
Space, tab, LF, CR and a few others.

The str_word_count function also finds word boundaries. Maybe I could use this function with parm 2 to get the word boundary positions before and after the given position.

From the PHP manual:

Code:
<?php
$str = "Hello fri3nd, you're
       looking          good today!";
print_r(str_word_count($str, 1));
print_r(str_word_count($str, 2));
print_r(str_word_count($str, 1, 'àáãç3'));
echo str_word_count($str);
?>

output:


Array
(
[0] => Hello
[1] => fri
[2] => nd
[3] => you're
[4] => looking
[5] => good
[6] => today
)

Array
(
[0] => Hello
[6] => fri
[10] => nd
[14] => you're
[29] => looking
[46] => good
[51] => today
)

Array
(
[0] => Hello
[1] => fri3nd
[2] => you're
[3] => looking
[4] => good
[5] => today
)

7
 
i'm still not entirely sure why you would want to do this but it can be done.

instead of doing the whole thing, i wonder whether this would be enough for you. It captures each word separately and also stores its offset so that you can. You can derive the position of the word boundaries from this. however it clearly does not work if you are speficying a mid-point based on some other criterion than a word. But if it were a specific word then we can significantly tighten up this function. Basically, as ever in these forums, you've not given us enough information as to what you are trying to achieve in order for us to be able to suggest an optimum solution for you.

Code:
$words = getWordBreaks ("This is a long string of text");
echo "<pre>". print_r($words, true) . "</pre>";

function getWordBreaks($string){
 $pattern = '/\b(\w+)\b/imx';
 $result = preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
 return ($result) ? $matches[1] : false;
}

this should output
Code:
Array
(
    [0] => Array
        (
            [0] => This
            [1] => 0
        )

    [1] => Array
        (
            [0] => is
            [1] => 5
        )

    [2] => Array
        (
            [0] => a
            [1] => 8
        )

    [3] => Array
        (
            [0] => long
            [1] => 10
        )

    [4] => Array
        (
            [0] => string
            [1] => 15
        )

    [5] => Array
        (
            [0] => of
            [1] => 22
        )

    [6] => Array
        (
            [0] => text
            [1] => 25
        )

)
 
That function works well but str_word_count with option 2 essentially does the same thing.
 
so how do you determine the point at which you need to measure the before and after?
 
I'm using this:

Code:
function findWSPC()
{
$WhitespaceArray = array("\x20", "\x09", "\x00", "\x0A", "\x0D", "\x0B", "\xA0");
$NumOfWhitespaces = count($WhitespaceArray);
$endpos = strlen($checkstr) + 1;
for ($x=0; $x <= $NumOfWhitespaces; $x++) 
	{
	$tmppos = strpos($checkstr, $WhitespaceArray[$x], ($posINcheckstr+));
	if ($tmppos < $endpos and $tmppos !== false) { $endpos = $tmppos; }
	}
$startpos = 0;
for ($x=0; $x <= $NumOfWhitespaces; $x++) 
	{
	$tmppos = rstrpos($checkstr, $WhitespaceArray[$x], $posINcheckstr);  
	if ($tmppos > $startpos and $tmppos!== false) { $startpos = $tmppos; }
}

	
function rstrpos ($haystack, $needle, $offset)
{
    $size = strlen ($haystack);
    $pos = strpos (strrev($haystack), $needle, $size - $offset);
    if ($pos === false)
        return false;
    return $size - $pos;
 
would this not do the same? (i've not tested this script as i'm currently away from my desk and webserver!)

Code:
$words = getWordBreaks ("This is a long string of text", "string");
print_r($words);

function getWordBreaks($haystack, $string){
 $pattern = '/(\b)\w+(\b)/imx';
 $result = preg_match($pattern, $string, $match, PREG_OFFSET_CAPTURE);
 return = array('before'=>$match[1][1], 'after'=>$match[2][1]);
}
 
That is simpler and cleaner but I wonder how much extra overhead is incurred by using a REGEX (in your example) versus strpos (in mine).
 
in php5 there is some overhead. how much in real terms depends on your webserver. in practice over the internet the relative latency makes the use of regex on a reasonably modern server negligible. this is my perception rather than any measurable reality. I use WordPress a lot, which in turn makes use of a lot of regex calls, and although WP is slower than a standard static site, it's not unresponsive.

by contrast, my understanding is that php6 will run with PCRE pre-cached and so there will be little initial overhead. i have not looked at how the pre-cacheing/loading is done but i suspect that it is evidence that the developers also think that the overhead of doing so on modern systems is less relevant than it previously was. Guesswork here though.

if you were interested, it would be easy enough to create a script to benchmark these two methods.
 
i wrote a timer benchmarking class a few months back and posted it in this forum. you can find it here. might save you a few minutes coding.
 
not really. preg_match vs strpos will always result in strpos winning. but if you are trying to find all instances of the match within a long string, or trying to find a more complex set of rules, then preg_* based solutions will become quicker.

furthermore your code does quite a bit more processing than a single str_pos so these examples are not comparing like for like. that's not a reasonable benchmark. the only sane test is for you to benchmark the two code snips yourself across a few hundred iterations.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top