Programatically identifying an error page

ChrisRChamberlain · Apr 19, 2005

Hi all

If you navigate, programatically or otherwise, to a non-existant webpage, your browser should then display an error page.

One somewhat clumsy and not necessarily reliable way of determining programatically that you have an error page displayed is to check the document.title string for the existence of "404", or the document.body.innertext string also for the existence of "404".

Betters ideas, please?

TIA

FAQ184-2483 - answering getting answered.

Chris

PDFcommander^tm.net
PDFcommander^tm.co.uk

manarth · Apr 19, 2005

Hi Chris,

Having 404 in the title string, or in the page body, is a visual representation of the error. Custom error pages may not mention 404 at all.

The standard for determining the status of a server response is in the HTTP headers.

Here's TekTip's headers for a 404 with the significant line in red:

Code:

Request: GET [URL unfurl="true"]http://www.tek-tips.com/foo.htm[/URL] HTTP/1.0
[COLOR=red]HTTP/1.0 404 Not Found[/color]
Server: Microsoft-IIS/5.0
Date: Tue, 19 Apr 2005 07:49:05 GMT
Content-Length: 4040
Content-Type: text/html
Age: 74
Proxy-Connection: close

For testing simply the existance or otherwise of a page programatically, please do not use HTTP GET. As you only need the headers, use a HTTP HEAD request instead. It saves cluttering up bandwidth, and avoids skewing logfiles.
Another good practice in automatic checkers is to provide a contact (e.g. email address) in your User Agent string. If your script starts running away and DoSing someone, they can contact you to let you know!

Have a read of the W3C's header response reference - it lists the various codes you might see in addition to 404.

<marc>

ChrisRChamberlain · Apr 19, 2005

Marc

Thanks for your response - uncertain as to how to implement a HTTP HEAD request.

Any pointers, please?

FAQ184-2483 - answering getting answered.

Chris

PDFcommander^tm.net
PDFcommander^tm.co.uk

BillyRayPreachersSon · Apr 19, 2005

You should NEVER use a method such as the one you pointed out in your first post, Chris... It is very flawed. Consider this page:

Code:

<html>
<head>
   <title>How to write a custom error 404 page</title>

By your logic, because the title contains "404", it would be an error page - which is clearly not the case.

I don't have an answer for you on how to do this client-side... but I really would advise against the method you have proposed.

Hope this helps,
Dan

[tt]D'ya think I got where I am today because I dress like Peter Pan here?[/tt]
[banghead]

ChrisRChamberlain · Apr 19, 2005

Dan

Chris said:
...not necessarily reliable way...

It is flawed hence the question. [smile]

FAQ184-2483 - answering getting answered.

Chris

PDFcommander^tm.net
PDFcommander^tm.co.uk

BillyRayPreachersSon · Apr 19, 2005

This is where a "delete your own post" button would come in handy ;o)

Dan

[tt]D'ya think I got where I am today because I dress like Peter Pan here?[/tt]
[banghead]

manarth · Apr 19, 2005

ChrisRChamberlain said:
uncertain as to how to implement a HTTP HEAD request.

Any pointers, please?

Pick a language, any language...

It's generally fairly easy to implement, as it simply involves opening a TCP socket, sending a bunch of text down it, and waiting for the response. Simple, no?!

Code:

[COLOR=grey]Sample GET request[/color]
GET / HTTP/1.1
Host: [URL unfurl="true"]www.google.com[/URL]

[COLOR=grey]Sample HEAD request[/color]
HEAD / HTTP/1.1
Host: [URL unfurl="true"]www.google.com[/URL]

[COLOR=grey]Request format[/color]
[COLOR=red]ReqType[/color] [COLOR=green]URI[/color] [COLOR=blue]Protocol[/color]
Host: [COLOR=orange]HostName[/color]

I was surprised to find Wikipedia had a reasonable quick reference to HTTP requests.

<marc>

jstreich · Apr 19, 2005

Generally speaking, if your just going to download the site if it does exsist, and in the past it exsisted a GET is probably what you'd want to do to save bandwidth.

[plug=shameless]

http://game-master.us/phpx-3.4.0/

[/plug]

ChrisRChamberlain · Apr 20, 2005

The reason for attempting to establish that the page may be an error page from the page content can be seen from the following code snippit.

In pseudo code

Create an instance of IE
Navigate to a url
Wait for the download to complete

Code:

[COLOR=blue]oMainObject.oIE = CREATEOBJECT([InternetExplorer.Application])
oMainObject.oIE.Navigate(.cURL)

lnSeconds = SECONDS()	
DO WHILE oMainObject.oIE.Busy ;		
[tab]OR oMainObject.oIE.ReadyState # 4

[tab]IF SECONDS() > lnSeconds + USER.timeout
[tab][tab].lTimedOut = .T.
[tab][tab]EXIT
[tab]ENDI	
ENDDO[/color]

If timed out, then Messagebox with error message and abort
If not, verify page is the one required- if OK, do this, if not do that.

End of pseudo code

As I understand it HTTP HEAD is identical to HTTP GET but returns only header info. I see a problem in that the URL would have to be parsed to send server and URL separately.

So, as an alternative as the entire page in available in the IE object, what other means of identification may there be?

FAQ184-2483 - answering getting answered.

Chris

PDFcommander^tm.net
PDFcommander^tm.co.uk

manarth · Apr 20, 2005

Looks like you're using WSH? MSDN's Internet Explorer Object reference.

The NavigateError Event may be of some help. It doesn't fire when the server provides a 404 redirect, but coupling it with a comparison of your target URL and actual URL you're on may do the trick.

Unfortunately I can't see that the IE object provides a method for finding out the status of an URL.

<marc>

ChrisRChamberlain · Apr 20, 2005

manarth said:
Looks like you're using WSH?

The language is VFP.

I am aware of the NavigateError() event but unsure as to how to trap it. I did start a thread, thread1253-1045644, to persue that option.

As I read it, probably incorrectly, it does fire and returns the method's pDispParams parameter or Status Code value HTTP_STATUS_NOT_FOUND, (404), indicating page not found.

The IE object has a property .LocationURL which becomes available once the download is complete.

This may differ from the target URL should a webpage redirect be in effect so is useless for the purpose of determining whether or not the page is an error page.

Am working on a class to implement manarth's suggestion of HTTP HEAD - it's a pity that the page content apparently does not provide an answer.

FAQ184-2483 - answering getting answered.

Chris

PDFcommander^tm.net
PDFcommander^tm.co.uk

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Programatically identifying an error page

ChrisRChamberlain

Programmer

manarth

Programmer

ChrisRChamberlain

Programmer

BillyRayPreachersSon

Programmer

ChrisRChamberlain

Programmer

BillyRayPreachersSon

Programmer

manarth

Programmer

jstreich

Programmer

ChrisRChamberlain

Programmer

manarth

Programmer

ChrisRChamberlain

Programmer

Similar threads

Part and Inventory Search

Sponsor