Help extracting section of text 1

solepixel · Aug 30, 2007

I'm performing a search on a text file (to be more specific, the OFAC report on the US Treasury Site). Basically, what I'm doing is fetching the file and putting the contents into a variable via file_get_contents. Then performing a strstr with the search text and the OFAC variable. What I need to do is if it finds the string, i want to extract all the text starting with the double-line break before the term was found and end with the double-line break after the term was found. For example, here's what i'm trying to do:

Code:

this is some text that would be found on the ofac report. it's a paragraph of it's own. it is separate from other paragraphs.

this is another paragraph similar to one that would be found on the ofac report. it's much longer than the first paragraph, and when i say longer, i mean it has more words in it. this paragraph contains more letters, spaces, and punctuation than the paragraph before it. i guess that's what makes it longer. i could've used lorem ipsum but i decided this would be funner.

this is the last paragraph. it's not long at all.

So in the above example, if the user searched for "when i say longer", I would want it to get the entire 2nd paragraph. if they searched for "separate from other paragraphs", I would want to display the entire first paragraph. Does that make sense? Is this possible?

jpadie · Aug 30, 2007

this code will return all paragraphs that have the matching text within them.

you do have to be confident that the report uses standard line terminators. you might need to finesse a bit if there is any MAC style line termination going on (carriage return without line feed). other variants are catered for.

Code:

<?php 
$text = "
this is some text that would be found on the ofac report. it's a paragraph of it's own. it is separate from other paragraphs.

this is another paragraph similar to one that would be found on the ofac report. it's much longer than the first paragraph, and when i say longer, i mean it has more words in it. this paragraph contains more letters, spaces, and punctuation than the paragraph before it. i guess that's what makes it longer. i could've used lorem ipsum but i decided this would be funner.

this is the last paragraph. it's not long at all.";

function getPara ($search, $text){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace("\r", '', $text);
	$pattern = "/(.*?$search.*?)\\n\\n/im";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return count($return) > 0 ? $return : "No results found";
}

print_r(getPara("when I say longer", $text));
?>

solepixel · Aug 30, 2007

I've tried modifying this just a bit, but I can't figure out how to get it to work. Here's the modification (not much);

Code:

function getPara ($needle, $haystack, $num=0){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace("\r", '', $haystack);
	$pattern = "/(.*?$search.*?)\\n\\n/im";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return $return[$num];
}

Basically, I want it to return the first paragraph found by default. Then I can step through each result if I want. The problem I'm having is the results are coming up. My guess is it's because of one of these reasons:

1. The search is case sensitive. Just before i do strstr, i lowercase everything to ensure it finds the string in various ways, however in the results, it's an unmodified version of the text file.

2. It's possible the file contains non-standard line breaks, which that I'm not of. I wouldn't think they're non-standard. It's a txt file. You can view it here:

http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt

jpadie · Aug 30, 2007

if you just want the first result i'd use the code i posted and just use the first element in the output array

this code works for me. note that the search is case insensitive.

Code:

<?php 
function getPara ($search, $text){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace("\r", '', $text);
	$pattern = "/(.*?$search.*?)\\n\\n/ims";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return count($return) > 0 ? $return : "No results found";
}

$results = getPara("Specially Designated Nationals", file_get_contents("[URL unfurl="true"]http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt"));[/URL]
if (is_array($results)){
	echo $results[0];
} else {
	echo $results;
}
?>

jpadie · Aug 30, 2007

in fact the text has line breaks at the end of every "line" rather than at the end of every paragraph. this is going to make life difficult.

so try this code instead. it cleanses the text first.

Code:

function getPara ($search, $text){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace(array("\r", "\n\n", "\n"), array('','{PP}',' '), $text);
	$_text = "\n".str_replace('{PP}', "\n", $_text);

	$pattern = "/\\n(.*?$search.*?)\\n/ims";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return count($return) > 0 ? $return : "No results found";
}

solepixel · Aug 30, 2007

Try this:

$results = getPara("bin laden", file_get_contents("

http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt"));

if (is_array($results)){
echo nl2br($results[0]);
} else {
echo $results;
}

solepixel · Aug 30, 2007

my fault about not putting it in code blocks:

Code:

$results = getPara("bin laden", file_get_contents("[URL unfurl="true"]http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt"));[/URL]
if (is_array($results)){
    echo nl2br($results[0]);
} else {
    echo $results;
}

solepixel · Aug 30, 2007

Ok, I just noticed you changed the function a bit.

After trying the new function, when I use "bin laden" as my search phrase, it will display the entire document up to the end of the paragraph containing my search phrase. Before, it was just giving me the entire document.

jpadie · Aug 30, 2007

that's funny. with my function as last posted, I get a nil response for "bin laden". I wonder whether we are seeing a platform difference in the PCRE implementation?

can you try removing the multiline switch (the m just before the end of the pattern) and seeing what happens?

i've spent too long on this tonight and have crashed my regex editor a couple of times with the length of the input document. I'll take another look in the morning if you have not solved the problem.

there are alternatives to using a regular expression. you could, for example, use strpos to:
determine the position of the first match,
determine the immediately preceding newline
determine the immediately following newline

then use substr to excise the paragraph by its position.

but note that you will still have to reparse the text to get rid of the new line character at the end of each new line.

solepixel · Aug 31, 2007

Removing the M didn't seem to do anything different.

I'd like to possibly look into the alternate method you described. I started working on that yesterday before I left. I will see what I can come up with.

solepixel · Aug 31, 2007

Ok, what I got is a working version without manipulating the original string that will find the first occurance. Now I just need to adapt it so it finds all occurances and puts each paragraph into an array so I can step through them.

Code:

function getMidText($needle, $haystack){
	$mid_text = array();
	$lower_text = strtolower($haystack);
	$search_text = strtolower($needle);
	
	if(strstr($lower_text, $search_text)){
		//determine the position of the first match
		$match_location = strpos($lower_text,$search_text);
		//determine the immediately preceding newline
		$reversed = substr($haystack,0,$match_location + strlen($search_text));
		$reversed = strrev($reversed);
		$prev_nl = strpos($reversed,"\n\n");
		$prev_nl = $match_location - $prev_nl;
		$prev_nl = $prev_nl + strlen($needle);
		//determine the immediately following newline
		$newtext = substr($haystack,$prev_nl,strlen($haystack));
		$next_nl = strpos($newtext,"\n\n");
		
		$section = substr($haystack, $prev_nl, $next_nl);
		$section = highlight($section, $needle);
		$mid_text[] = nl2br($section);
	} else {
		$mid_text = 'No results found.';
	}
	
	return $mid_text;
}

jpadie · Aug 31, 2007

well done.

preg_match_all is a better solution though. it's still not clear to me why the pattern i wrote is not working.

solepixel · Aug 31, 2007

Ok, so I know this isn't the best way to do it, but to avoid infinite loops i'm just going to limit this to 10 items, but for some reason, I can't get it to move forward to the next item. What am I doing wrong?

Code:

function getSections($needle, $haystack){
   $mid_text = array();
   $lower_text = strtolower($haystack);
   $search_text = strtolower($needle);
   
   if(strstr($lower_text, $search_text)){
      $prev_section = "";
      for($i=0; $i <= 10; $i++){
         //determine the position of the first match
         $match_location = ($i == 0) ? strpos($lower_text,$search_text) : strpos($lower_text,$search_text,strpos($match_location)+strlen($needle));
         //determine the immediately preceding newline
         $reversed = substr($haystack,0,$match_location + strlen($search_text));
         $reversed = strrev($reversed);
         if(strpos($reversed,"\n\n")){
            $prev_nl = strpos($reversed,"\n\n");
            $prev_nl = $match_location - $prev_nl;
            $prev_nl = $prev_nl + strlen($needle);
         } else {
            $prev_nl = 0;
         }
         //determine the immediately following newline
         $newtext = substr($haystack,$prev_nl,strlen($haystack));
         $next_nl = (strpos($newtext,"\n\n")) ? strpos($newtext,"\n\n") : strlen($haystack);
         
         $section = substr($haystack, $prev_nl, $next_nl);
         if($prev_section != $section){
            $prev_section = $section;
            $section = highlight($section, $needle);
            $mid_text[] = nl2br($section);
         }
      }
   } else {
      $mid_text = 'No results found.';
   }
   
   return $mid_text;
}

I think it has something to do with the $match_location line

jpadie · Aug 31, 2007

here is a different approach that might be neater.

the secondary function is to cater for those readers that are using PHP < v5.

I have also been experimenting with another pattern. this pattern:

Code:

\n\n((?!\n\n).)*bin laden((?!\n\n).)*

works perfectly in my regex application but, for some reason, does not similarly perform in PHP. i can't determine why...

anyway, here is the alternative code.

Code:

<?php 
function getPara ($needle, $haystack){
	//explode the text into an array
	$haystack = explode("\n\n", $haystack);
	foreach ($haystack as $para){
		if (_stristr($para, $needle)){
			$return[] = $para;
		}
	}
	if (is_array($return)){
		return $return;
	}else{
		return "no matches found";
	}
}
function _stristr($haystack, $needle){
	if (function_exists("stristr")){
		return stristr($haystack, $needle);
	} else {
		return preg_match("/$needle/ims", $haystack);
	}
}

$text = file_get_contents("[URL unfurl="true"]http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt");[/URL]
echo "<pre>".print_r(getPara('bin laden', $text), true). "</pre>";
?>

solepixel · Aug 31, 2007

Wow, that is way smarter than the way I did it

Seems to work pretty good. Nice work!

solepixel · Sep 4, 2007

ok, one last question about this topic. Would a comma (,) cause this to fail for some reason? I've tried searching for lastname, firstname and get weird results. Does this need to be modified to allow for a comma?

jpadie · Sep 4, 2007

nope. it should handle it just fine.

i still think that preg_match is better though! wish I could get to the bottom of why it's not working as expected...

solepixel · Sep 4, 2007

Ok. I was doing something wrong (as usual). Let me know if you ever try to get the pre_match version working. honestly, though, I wouldn't worry about it, this way works fine. Thanks again.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Help extracting section of text 1

Programmer

Technical User

Programmer

Technical User

Technical User

Programmer

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor