regular expression help. I think? 1

idtstudios2 · Dec 16, 2005

The following code is supposed to search a page ($html) for tags. It starts out by gathering all links (<a></a>) and then checks to see if rel="tag". It then returns all tags in the following format: hi,wall,brick

Now, the script "technically" works except for the fact that it is processing every link, instead of just the ones with rel="tag". I'm pretty sure that the problem has to deal with a regular expression mistake however I can't find it. Any help or input would be appreciated. thanks.

function get_tag($url)
{
$p_url = parse_url($url);
if (!empty($p_url['path'])):
$path = explode ("/", trim ($p_url['path'], "/")); //gets rid of starting and trailing slash then splits the path string into an array separated by the slash
return ($path[count($path)-1]); // returns the last element of the array
else:
die("error");
endif;
}

function get_tags($html, $url){
if(!$html or !$location){
return false;
}else{
#search through the HTML, save all <link> tags
# and store each link's attributes in an associative array
preg_match_all('/<a\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$final_links = array();
$link_count = count($links);
for($n=0; $n<$link_count; $n++){
$attributes = preg_split('/\s+/s', $links[$n]);
foreach($attributes as $attribute){
$att = preg_split('/\s*=\s*/s', $attribute, 2);
if(isset($att[1])){
$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
$final_link[strtolower($att[0])] = $att[1];
}
}
$final_links[$n] = $final_link;
}
#now figure out which one contains the directory tag
for($n=1; $n<$link_count; $n++){
if(strtolower($final_links[$n]['rel']) == 'tag'){
$href = $final_links[$n]['href'];
$temp_tag = get_tag($href);
if ($temp_tag == "error") {
//do nothing
} else {
$tags = $tags . "" . $temp_tag . ",";
}
}
}

if(substr($tags, -1) == ','){
$tags = rtrim($tags, ",");
}
return $tags;
}
}

jpadie · Dec 17, 2005

i have never been able to get my head around regex. having said that I'm convinced that your problem can be solved with a single expression...

i *think* that this code below does what you want for the href attribute. for other attributes you can just change the relevant text.

Code:

<?
$html = file_get_contents("c:/html.txt");

        #search through the HTML, save all <link> tags
        # and store each link's attributes in an associative array
$pattern = "/href\s*=\s*\"*[^\">]*/i";
preg_match_all('/<a\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$matches = ""; //release memory
foreach ($links as $var):
	$pos=strpos($var,"HREF=");
	if (!$pos):  //strpos returns false if no pattern match. faster than ereg/preg
		# do nothing
	else:
		$new_array[] = $var;
	endif;
endforeach;
//now parse the $new_array
	//look for the first "space" or ">" after the incidence of res
if (is_array($new_array)):
	foreach ($new_array as $var): 
		preg_match_all ($pattern, $var, $newmatches[]);
	endforeach;
else:
	die ("nothing found");
endif;
foreach ($newmatches as $var):
	$pen_ult[] = $var[0][0];  //not sure why the variable is so deep here.
endforeach;
$newmatches = "";
foreach ($pen_ult as $var):
	$a = ltrim($var,"HREF=\"");
	$ult[] = ltrim($a,"href=\""); \\not sure if ltrim is case sensitive
endforeach;
echo "<pre>";
print_r($ult);
echo "</pre>";

?>

idtstudios2 · Dec 17, 2005

Not only does your method work, it is also about 50% faster. Thank you very much.

For some odd reason though ltrim was chopping of the "h" in http as well as the "href="". The links came out in the following format:

ttp:google.com

I have no clue why ltrim is doing this but substituting:

$a = ltrim($var,"HREF=\"");
$ult[] = ltrim($a,"href=\""); \\not sure if ltrim is case sensitive

For:

$a = str_replace ("href=\"","", $var);
$ult[] = str_replace ("HREF=\"","", $a);

fixes the problem.
Thanks,
Andrew

jpadie · Dec 18, 2005

str_replace is much better than ltrim. i think ltrim treats the needle as a character list rather than a complete string.

better than str_replace is str_ireplace which will handle both cases. ditto i would change the strpos call to stripos to guarantee picking up both cases.

i still think you should be able to extract the href tag directly from the html soup using a single regex.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

regular expression help. I think? 1

idtstudios2

Programmer

jpadie

Technical User

idtstudios2

Programmer

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor