Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Script for detecting broken links? 1

Status
Not open for further replies.

trickyray

Programmer
May 18, 2006
8
CA
Is it possible to detect whether a given URL is valid or not in php?
I know there exists programs that do this but I need this as part of a script that I am working on.

I could try connecting to the site and wait for a time-out but this would not work for sites that have a 404 page...
Any ideas?

Thanks for your time
 
Maybe this will be close:

Code:
<?php
function check_url ($url)
{
	$retval = FALSE;
	
	$url_parts = parse_url ($url);
	
	$fp = fsockopen($url_parts['host'], 80, $errno, $errstr, 30);

	if ($fp !== FALSE)
	{
		$out = 'GET ';
		if (!isset($url_parts['path']))
		{
			$out .= '/';
		}
		else
		{
			$out .= $url_parts['path'];
		}
		
		$out .= " HTTP/1.1\r\n";
		$out .= "Host: 10.0.1.133\r\n";
		$out .= "Connection: Close\r\n\r\n";
		
		fwrite($fp, $out);
		
		$response = fgets($fp);
		
		$response_array = explode (' ', $response);
		
		$retval = $response_array[1];
		
	}
	fclose($fp);
	return $retval;
}


$urls = array
(
	'[URL unfurl="true"]http://www.mit.edu/',[/URL] //this one is good
	'[URL unfurl="true"]http://localhost/phpinfo.php',[/URL] //this one is good
	'[URL unfurl="true"]http://localhost/test_farg.php'[/URL]  //this one is bad
);

print '<html><body>';

foreach ($urls as $url)
{
	print $url . ' : ';
	
	$response = check_url ($url);
	
	switch ($response)
	{
		case '200':
			print 'good';
			break;
		case FALSE:
		case '404':
			print 'not found';
			break;
	}
	print '<br>';
}

print '</body></html>';
?>

The function check_url() will return a FALSE (if something goes wrong) or the HTTP response code generated by the server by the requested URL.

This may or may not work. If you are running PHP at a hosting site, the site may not allow you to use fsockopen().



Want the best answers? Ask the best questions! TANSTAAFL!
 
Thanks
sleipnir214's solution was exactly what I was looking for :)
 
Oops...

There is one change I recommend you make to the script I posted.

Replace the line which reads:

$out = 'GET ';

with this line:

$out = 'HEAD ';


According to the HTTP spec at the response to a HEAD request should be identical to the response to a GET request. It's just a HEAD request response does not send the body of the response. The original version of the script was having the server generate the request, but not fetching it from the connection.

The change might make the script a little faster, as scripts will not be run on servers. Also, the change makes you a better net citizen by using fewer resources on these servers.



Want the best answers? Ask the best questions! TANSTAAFL!
 
Ok so I'm trying to use your script but I've found that it isn't 100% accurate
For example, the following URL's all return 404 using your script but work fine when I try it manually in the browser:


Is there any fix for this?
 
My apologies....

I posted an interim version of the code, not the final version.

This line:

$out .= "Host: 10.0.1.133\r\n";

should be replaced with

$out .= "Host: " . $url_parts['host'] . "\r\n";


Additional functionality needs to be added to the switch statement. At least one of the URLs you posted will return a 300-series response, and the script will not react at all to that.




Want the best answers? Ask the best questions! TANSTAAFL!
 
That did it, thanks again =)

One last problem... Like you said, it's not handling 300-series responses, which I now understand to mean that the given URL is acting as a redirect.

This is a problem because the redirected URL is not actually checked for 404, and so my script ignores a large portion of broken links

I can't seem to figure out how to get the redirected URL
 
When a web server responds to a web request with a status in the 300s, the server will generally send a "Location:" header which tells the browser to which URL to go.

When your script gets a response code in the 300s, it should read more lines from the server, looking for "Location:". When it finds it, it should try that URL.

Keep in mind that the URL in the "Location:" header can be partial. If when trying to fetch " your script gets back "Location: /foo/bar/baz.php", then your script must assume the current host of "




Want the best answers? Ask the best questions! TANSTAAFL!
 
I've noodled around with following 3xx redirects. This is now my version of the function check_url():

Code:
function check_url ($url)
{
	$retval = FALSE;
	
	$url_parts = parse_url ($url);
	
	$fp = fsockopen($url_parts['host'], 80, $errno, $errstr, 30);

	if ($fp !== FALSE)
	{
		$out = 'HEAD ';
		if (!isset($url_parts['path']))
		{
			$out .= '/';
		}
		else
		{
			$out .= $url_parts['path'];
		}
		
		$out .= " HTTP/1.1\r\n";
		$out .= "Host: " . $url_parts['host'] . "\r\n";
		$out .= "Connection: Close\r\n\r\n";
		
		fwrite($fp, $out);
		
		$response = fgets($fp);
		
		$response_array = explode (' ', $response);
		
		if ($response_array[1][0] == '3')
		{
			$keep_looking = TRUE;
			while ($keep_looking and $line = fgets($fp))
			{
				$line = trim($line);
				if (strstr ($line, 'Location: ') !== FALSE)
				{
					$keep_looking = FALSE;
				}
			}

			if (!$keep_looking)
			{
				$location_array = explode (' ', $line);
				
				$location_url = parse_url($location_array[1]);
				
				if (isset($location_url['scheme']))
				{
					$redirect_url = $location_url['scheme'] . '://';
				}
				else
				{
					$redirect_url = '[URL unfurl="true"]http://';[/URL]
				}
				
				if (isset($location_url['host']))
				{
					$redirect_url .= $location_url['host'];
				}
				else
				{
					$redirect_url .= $url_parts['host'];
				}
				
				$redirect_url .= $location_url['path'];
				
				$retval = check_url($redirect_url);
			}
		}
		else
		{
			$retval = $response_array[1];
		}
	}
	fclose($fp);
	return $retval;
}

Keep in mind that this function can handle redirects to locations like:

/index.html

but it will not handle things like:

../../foo.php

which some people simply cannot resist using.



Want the best answers? Ask the best questions! TANSTAAFL!
 
Wow. Works like a charm now :D

Except... this one URL:

Which crashes the script with the following error:
=================================================
Checking URL: Fatal error: Allowed memory size of 10485760 bytes exhausted (tried to allocate 8192 bytes) in (file) on line 32
=================================================

It refers to the line "$response = fgets($fp);"
I'm totally lost on this one...

Sorry I hate to keep finding bugs and bothering you with it :(
 
hmm yeah that bug looks similar to my problem

I tried everything suggested there but I'm still stuck :(
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top