Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help extracting section of text 1

Status
Not open for further replies.

solepixel

Programmer
May 30, 2007
111
US
I'm performing a search on a text file (to be more specific, the OFAC report on the US Treasury Site). Basically, what I'm doing is fetching the file and putting the contents into a variable via file_get_contents. Then performing a strstr with the search text and the OFAC variable. What I need to do is if it finds the string, i want to extract all the text starting with the double-line break before the term was found and end with the double-line break after the term was found. For example, here's what i'm trying to do:

Code:
this is some text that would be found on the ofac report. it's a paragraph of it's own. it is separate from other paragraphs.

this is another paragraph similar to one that would be found on the ofac report. it's much longer than the first paragraph, and when i say longer, i mean it has more words in it. this paragraph contains more letters, spaces, and punctuation than the paragraph before it. i guess that's what makes it longer. i could've used lorem ipsum but i decided this would be funner.

this is the last paragraph. it's not long at all.

So in the above example, if the user searched for "when i say longer", I would want it to get the entire 2nd paragraph. if they searched for "separate from other paragraphs", I would want to display the entire first paragraph. Does that make sense? Is this possible?
 
this code will return all paragraphs that have the matching text within them.

you do have to be confident that the report uses standard line terminators. you might need to finesse a bit if there is any MAC style line termination going on (carriage return without line feed). other variants are catered for.

Code:
<?php 
$text = "
this is some text that would be found on the ofac report. it's a paragraph of it's own. it is separate from other paragraphs.

this is another paragraph similar to one that would be found on the ofac report. it's much longer than the first paragraph, and when i say longer, i mean it has more words in it. this paragraph contains more letters, spaces, and punctuation than the paragraph before it. i guess that's what makes it longer. i could've used lorem ipsum but i decided this would be funner.

this is the last paragraph. it's not long at all.";

function getPara ($search, $text){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace("\r", '', $text);
	$pattern = "/(.*?$search.*?)\\n\\n/im";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return count($return) > 0 ? $return : "No results found";
}

print_r(getPara("when I say longer", $text));
?>
 
I've tried modifying this just a bit, but I can't figure out how to get it to work. Here's the modification (not much);

Code:
function getPara ($needle, $haystack, $num=0){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace("\r", '', $haystack);
	$pattern = "/(.*?$search.*?)\\n\\n/im";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return $return[$num];
}

Basically, I want it to return the first paragraph found by default. Then I can step through each result if I want. The problem I'm having is the results are coming up. My guess is it's because of one of these reasons:

1. The search is case sensitive. Just before i do strstr, i lowercase everything to ensure it finds the string in various ways, however in the results, it's an unmodified version of the text file.

2. It's possible the file contains non-standard line breaks, which that I'm not of. I wouldn't think they're non-standard. It's a txt file. You can view it here:
 
if you just want the first result i'd use the code i posted and just use the first element in the output array

this code works for me. note that the search is case insensitive.
Code:
<?php 
function getPara ($search, $text){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace("\r", '', $text);
	$pattern = "/(.*?$search.*?)\\n\\n/ims";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return count($return) > 0 ? $return : "No results found";
}

$results = getPara("Specially Designated Nationals", file_get_contents("[URL unfurl="true"]http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt"));[/URL]
if (is_array($results)){
	echo $results[0];
} else {
	echo $results;
}
?>
 
in fact the text has line breaks at the end of every "line" rather than at the end of every paragraph. this is going to make life difficult.

so try this code instead. it cleanses the text first.
Code:
function getPara ($search, $text){
	$return = array();
	//convert to a consistent line break notation
	$_text = str_replace(array("\r", "\n\n", "\n"), array('','{PP}',' '), $text);
	$_text = "\n".str_replace('{PP}', "\n", $_text);

	$pattern = "/\\n(.*?$search.*?)\\n/ims";
	preg_match_all($pattern, $_text, $matches);
	//cleanse results
	unset($matches[0]);
	foreach ($matches as $match){
		if (!empty($match[0])){
			$return[] = $match[0];
		}
	}
	return count($return) > 0 ? $return : "No results found";
}
 
my fault about not putting it in code blocks:
Code:
$results = getPara("bin laden", file_get_contents("[URL unfurl="true"]http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt"));[/URL]
if (is_array($results)){
    echo nl2br($results[0]);
} else {
    echo $results;
}
 
Ok, I just noticed you changed the function a bit.

After trying the new function, when I use "bin laden" as my search phrase, it will display the entire document up to the end of the paragraph containing my search phrase. Before, it was just giving me the entire document.
 
that's funny. with my function as last posted, I get a nil response for "bin laden". I wonder whether we are seeing a platform difference in the PCRE implementation?

can you try removing the multiline switch (the m just before the end of the pattern) and seeing what happens?

i've spent too long on this tonight and have crashed my regex editor a couple of times with the length of the input document. I'll take another look in the morning if you have not solved the problem.

there are alternatives to using a regular expression. you could, for example, use strpos to:
determine the position of the first match,
determine the immediately preceding newline
determine the immediately following newline

then use substr to excise the paragraph by its position.

but note that you will still have to reparse the text to get rid of the new line character at the end of each new line.
 
Removing the M didn't seem to do anything different.

I'd like to possibly look into the alternate method you described. I started working on that yesterday before I left. I will see what I can come up with.
 
Ok, what I got is a working version without manipulating the original string that will find the first occurance. Now I just need to adapt it so it finds all occurances and puts each paragraph into an array so I can step through them.

Code:
function getMidText($needle, $haystack){
	$mid_text = array();
	$lower_text = strtolower($haystack);
	$search_text = strtolower($needle);
	
	if(strstr($lower_text, $search_text)){
		//determine the position of the first match
		$match_location = strpos($lower_text,$search_text);
		//determine the immediately preceding newline
		$reversed = substr($haystack,0,$match_location + strlen($search_text));
		$reversed = strrev($reversed);
		$prev_nl = strpos($reversed,"\n\n");
		$prev_nl = $match_location - $prev_nl;
		$prev_nl = $prev_nl + strlen($needle);
		//determine the immediately following newline
		$newtext = substr($haystack,$prev_nl,strlen($haystack));
		$next_nl = strpos($newtext,"\n\n");
		
		$section = substr($haystack, $prev_nl, $next_nl);
		$section = highlight($section, $needle);
		$mid_text[] = nl2br($section);
	} else {
		$mid_text = 'No results found.';
	}
	
	return $mid_text;
}
 
well done.

preg_match_all is a better solution though. it's still not clear to me why the pattern i wrote is not working.
 
Ok, so I know this isn't the best way to do it, but to avoid infinite loops i'm just going to limit this to 10 items, but for some reason, I can't get it to move forward to the next item. What am I doing wrong?
Code:
function getSections($needle, $haystack){
   $mid_text = array();
   $lower_text = strtolower($haystack);
   $search_text = strtolower($needle);
   
   if(strstr($lower_text, $search_text)){
      $prev_section = "";
      for($i=0; $i <= 10; $i++){
         //determine the position of the first match
         $match_location = ($i == 0) ? strpos($lower_text,$search_text) : strpos($lower_text,$search_text,strpos($match_location)+strlen($needle));
         //determine the immediately preceding newline
         $reversed = substr($haystack,0,$match_location + strlen($search_text));
         $reversed = strrev($reversed);
         if(strpos($reversed,"\n\n")){
            $prev_nl = strpos($reversed,"\n\n");
            $prev_nl = $match_location - $prev_nl;
            $prev_nl = $prev_nl + strlen($needle);
         } else {
            $prev_nl = 0;
         }
         //determine the immediately following newline
         $newtext = substr($haystack,$prev_nl,strlen($haystack));
         $next_nl = (strpos($newtext,"\n\n")) ? strpos($newtext,"\n\n") : strlen($haystack);
         
         $section = substr($haystack, $prev_nl, $next_nl);
         if($prev_section != $section){
            $prev_section = $section;
            $section = highlight($section, $needle);
            $mid_text[] = nl2br($section);
         }
      }
   } else {
      $mid_text = 'No results found.';
   }
   
   return $mid_text;
}
I think it has something to do with the $match_location line
 
here is a different approach that might be neater.

the secondary function is to cater for those readers that are using PHP < v5.

I have also been experimenting with another pattern. this pattern:
Code:
\n\n((?!\n\n).)*bin laden((?!\n\n).)*
works perfectly in my regex application but, for some reason, does not similarly perform in PHP. i can't determine why...

anyway, here is the alternative code.
Code:
<?php 
function getPara ($needle, $haystack){
	//explode the text into an array
	$haystack = explode("\n\n", $haystack);
	foreach ($haystack as $para){
		if (_stristr($para, $needle)){
			$return[] = $para;
		}
	}
	if (is_array($return)){
		return $return;
	}else{
		return "no matches found";
	}
}
function _stristr($haystack, $needle){
	if (function_exists("stristr")){
		return stristr($haystack, $needle);
	} else {
		return preg_match("/$needle/ims", $haystack);
	}
}

$text = file_get_contents("[URL unfurl="true"]http://www.ustreas.gov/offices/enforcement/ofac/sdn/sdnlist.txt");[/URL]
echo "<pre>".print_r(getPara('bin laden', $text), true). "</pre>";
?>
 
Wow, that is way smarter than the way I did it :)
Seems to work pretty good. Nice work!
 
ok, one last question about this topic. Would a comma (,) cause this to fail for some reason? I've tried searching for lastname, firstname and get weird results. Does this need to be modified to allow for a comma?
 
nope. it should handle it just fine.

i still think that preg_match is better though! wish I could get to the bottom of why it's not working as expected...
 
Ok. I was doing something wrong (as usual). Let me know if you ever try to get the pre_match version working. honestly, though, I wouldn't worry about it, this way works fine. Thanks again.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top