Given Position in string, find 1st word boundary before & after 1

sen5241b · Oct 23, 2008

What is the easiest way, given a position in a string, to find the first word boundary before & after the position?

I've tried searching for whitespaces in both directions but this seems clunky and inelegant and I've had trouble finding a unicode list of ALL possible whitespaces.

jpadie · Oct 23, 2008

how do you define a word boundary?

sen5241b · Oct 23, 2008

I've been using these characters:

Code:

$WhitespaceArray = array("\x20", "\x09", "\x00", "\x0A", "\x0D", "\x0B", "\xA0");

Space, tab, LF, CR and a few others.

The str_word_count function also finds word boundaries. Maybe I could use this function with parm 2 to get the word boundary positions before and after the given position.

From the PHP manual:

Code:

<?php
$str = "Hello fri3nd, you're
       looking          good today!";
print_r(str_word_count($str, 1));
print_r(str_word_count($str, 2));
print_r(str_word_count($str, 1, 'àáãç3'));
echo str_word_count($str);
?>

output:

Array
(
[0] => Hello
[1] => fri
[2] => nd
[3] => you're
[4] => looking
[5] => good
[6] => today
)

Array
(
[0] => Hello
[6] => fri
[10] => nd
[14] => you're
[29] => looking
[46] => good
[51] => today
)

Array
(
[0] => Hello
[1] => fri3nd
[2] => you're
[3] => looking
[4] => good
[5] => today
)

7

jpadie · Oct 24, 2008

i'm still not entirely sure why you would want to do this but it can be done.

instead of doing the whole thing, i wonder whether this would be enough for you. It captures each word separately and also stores its offset so that you can. You can derive the position of the word boundaries from this. however it clearly does not work if you are speficying a mid-point based on some other criterion than a word. But if it were a specific word then we can significantly tighten up this function. Basically, as ever in these forums, you've not given us enough information as to what you are trying to achieve in order for us to be able to suggest an optimum solution for you.

Code:

$words = getWordBreaks ("This is a long string of text");
echo "<pre>". print_r($words, true) . "</pre>";

function getWordBreaks($string){
 $pattern = '/\b(\w+)\b/imx';
 $result = preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
 return ($result) ? $matches[1] : false;
}

this should output

Code:

Array
(
    [0] => Array
        (
            [0] => This
            [1] => 0
        )

    [1] => Array
        (
            [0] => is
            [1] => 5
        )

    [2] => Array
        (
            [0] => a
            [1] => 8
        )

    [3] => Array
        (
            [0] => long
            [1] => 10
        )

    [4] => Array
        (
            [0] => string
            [1] => 15
        )

    [5] => Array
        (
            [0] => of
            [1] => 22
        )

    [6] => Array
        (
            [0] => text
            [1] => 25
        )

)

sen5241b · Oct 25, 2008

That function works well but str_word_count with option 2 essentially does the same thing.

jpadie · Oct 25, 2008

so how do you determine the point at which you need to measure the before and after?

sen5241b · Oct 26, 2008

I'm using this:

Code:

function findWSPC()
{
$WhitespaceArray = array("\x20", "\x09", "\x00", "\x0A", "\x0D", "\x0B", "\xA0");
$NumOfWhitespaces = count($WhitespaceArray);
$endpos = strlen($checkstr) + 1;
for ($x=0; $x <= $NumOfWhitespaces; $x++) 
	{
	$tmppos = strpos($checkstr, $WhitespaceArray[$x], ($posINcheckstr+));
	if ($tmppos < $endpos and $tmppos !== false) { $endpos = $tmppos; }
	}
$startpos = 0;
for ($x=0; $x <= $NumOfWhitespaces; $x++) 
	{
	$tmppos = rstrpos($checkstr, $WhitespaceArray[$x], $posINcheckstr);  
	if ($tmppos > $startpos and $tmppos!== false) { $startpos = $tmppos; }
}

	
function rstrpos ($haystack, $needle, $offset)
{
    $size = strlen ($haystack);
    $pos = strpos (strrev($haystack), $needle, $size - $offset);
    if ($pos === false)
        return false;
    return $size - $pos;

jpadie · Oct 26, 2008

would this not do the same? (i've not tested this script as i'm currently away from my desk and webserver!)

Code:

$words = getWordBreaks ("This is a long string of text", "string");
print_r($words);

function getWordBreaks($haystack, $string){
 $pattern = '/(\b)\w+(\b)/imx';
 $result = preg_match($pattern, $string, $match, PREG_OFFSET_CAPTURE);
 return = array('before'=>$match[1][1], 'after'=>$match[2][1]);
}

sen5241b · Oct 27, 2008

That is simpler and cleaner but I wonder how much extra overhead is incurred by using a REGEX (in your example) versus strpos (in mine).

jpadie · Oct 27, 2008

in php5 there is some overhead. how much in real terms depends on your webserver. in practice over the internet the relative latency makes the use of regex on a reasonably modern server negligible. this is my perception rather than any measurable reality. I use WordPress a lot, which in turn makes use of a lot of regex calls, and although WP is slower than a standard static site, it's not unresponsive.

by contrast, my understanding is that php6 will run with PCRE pre-cached and so there will be little initial overhead. i have not looked at how the pre-cacheing/loading is done but i suspect that it is evidence that the developers also think that the overhead of doing so on modern systems is less relevant than it previously was. Guesswork here though.

if you were interested, it would be easy enough to create a script to benchmark these two methods.

sen5241b · Oct 27, 2008

I'll try that with microtime.

jpadie · Oct 27, 2008

i wrote a timer benchmarking class a few months back and posted it in this forum. you can find it here. might save you a few minutes coding.

sen5241b · Oct 27, 2008

This developer has answered the question better than I could:

http://lzone.de/articles/php-string-search.htm

sen5241b · Oct 28, 2008

http://www.improvedsource.com/view.php/PHP_v5_2_vs_PHP_v5_1/11/

and

http://dreamfall.blogspot.com/2008/02/php-benchmarks-strpos-vs-pregmatchall.html

This question is over-answered.

jpadie · Oct 29, 2008

not really. preg_match vs strpos will always result in strpos winning. but if you are trying to find all instances of the match within a long string, or trying to find a more complex set of rules, then preg_* based solutions will become quicker.

furthermore your code does quite a bit more processing than a single str_pos so these examples are not comparing like for like. that's not a reasonable benchmark. the only sane test is for you to benchmark the two code snips yourself across a few hundred iterations.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Given Position in string, find 1st word boundary before & after 1

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

jpadie

Technical User

sen5241b

IS-IT--Management

sen5241b

IS-IT--Management

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Given Position in string, find 1st word boundary before &amp; after 1

IS-IT--Management

Technical User

IS-IT--Management

Technical User

IS-IT--Management

Technical User

IS-IT--Management

Technical User

IS-IT--Management

Technical User

IS-IT--Management

Technical User

IS-IT--Management

IS-IT--Management

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor

Given Position in string, find 1st word boundary before & after 1