Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations biv343 on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

list all of a site's pages

Status
Not open for further replies.

peacecodotnet

Programmer
May 17, 2004
123
US
Is there a way that you could connect to another server's port and list all of the pages they have? For example, I might want to see what pages would have, and it would give me something like:

index.php
blog.php
face.jpg

Or something like that. Can this be done?

Peace out,
Peace Co.
 
You would need to crawl the site, starting at the inde page. Write a spider applicatio0n that takes all links from the HTML and crawls them, then assemble a site map.

As for a port, no, there isn't.
 
Good idea, thanks. I have another quetion now though. When I look at my web site's statistics it shows all the robots/spiders that have crawled my web site, and it lists them by name. Like Googlebot or Inktomi Slurp. How would I make my spider appear with a name like that?

Peace out,
Peace Co.
 
There is the concept of "user agent", which is information supplied by the client, in your case your spider.
14.43 User-Agent

The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. User agents SHOULD include this field with requests. The field can contain multiple product tokens (section 3.8) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.

User-Agent = "User-Agent" ":" 1*( product | comment )

Example:

User-Agent: CERN-LineMode/2.15 lib
This is from RFC2616, section 14
 
btw. there are applications, like "siteripper" which I personally have tested.

You can often specify what you want it to rip, how many levels of links it should follow, if it should go off-site, etc. etc.


Olav Alexander Mjelde
Admin & Webmaster
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top