Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Snoopy & Search Engines in PHP 1

Status
Not open for further replies.

chvol

Programmer
Oct 3, 2002
25
US
1. If all I want is the HTML from a given URL, e.g. what is the advantage of using class Snoopy vs. simply $html=join(“”,file(” ?

2. How can I write a search engine in PHP, other than by calling another search engine? I can walk through the links, of course, but how do I find all the starting point URLs?

Charlie
chvol@aol.com
 
I don't know what Snoopy is, so I'll assume it's some kind of web site vacuum.

1. I image the difference would be that Snoopy would download the HTML and also all the images, etc, that HTML file references. To accomplish the same thing in PHP, you'd have to use fopen() to get the HTML, then parse the HTML, then fetch and save all the associated content.

2. You'll have to write a spider, which goes out across IP addresses and domain-name URLs fetching content.

Want the best answers? Ask the best questions: TANSTAAFL!!
 
"You'll have to write a spider, which goes out across IP addresses and domain-name URLs fetching content"

And how do I do that?

How about some PHP code?

(Why do people say "oh, just do XYZ" and not explain how to???)

chvol@aol.com
 
1. A spider is program code running on a server. It fetches a web page, parses and stores that page's content, making note of all other content pointed to in links on that page. It then recursively performs this same action on all the content pointed-to in links on the previous page.

Fetching pages: PHP has many methods of fetching content from a web page, among them is the use of its fopen() function (
Parsing an HTML document and storing its significant keywords is something you can make a career of -- the folk at Google are a prime example. If you do not already know how to do it already, I recommend you research the construction of recursive descent parsers -- they're generally described in any good textbook on compiler construction.

2. Tek-Tips is not a code repository. If you want advice on writing your own code, ask questions. If you want someone to hand you code, I recommend you start searching SourceForge.net. I, for one, get paid to write PHP code for people.

3. People tell you what to do and not explain how because they do not know you. Thus, they can not know your level of expertise. Given that lack of intimate knowledge of your capabilities, how are people on a site such as Tek-Tips supposed to know which things you need noted and which things you need explained in depth? Telepathy? Precognition?

Want the best answers? Ask the best questions: TANSTAAFL!!
 
Well said Sleipnir214! (Points 2 and 3). Thanks for all your help in the forum!
 
It is easy to access the HTML of a given URL and parse it to determine its contents and links to follow recursively. That is not the problem. Here is the problem:

How do I determine the initial URL(s) that I have to start with which connect via links to all URLs that are out there? For example, if all of the URLs are in one big tree, then all I need to know is the single URL at the root. If there are multiple trees for which there is not a single URL that links to all of them, then I need the roots of all of these individual trees.

I need to know the mechanism that will return these URLs that are the starting points of the spider. I do not need help programming in PHP. I already know that. But without having access to the URLs to start with, I can't walk through them.

It is the values of the URLs that is lacking, not knowledge of how to process them.

For example, perhaps there is a construct that, each time it is evaluated, returns the next URL that is needed (as described above), and evaluates to "" when there are no more.

Charlie Volkstorf
chvol@aol.com
 
Create three databases. One is visited URLs, one is URLs to be visited, one is domain-names found.

Spawn a legion of spiders at attempt to go to port 80 on IP addresses 0.0.0.1 through 255.255.255.254. Have each spider record in the "visited URLs database" the IP address it visited. Have each spider record every domain-name found in all web content in the "found domain names" database. Have every spider record every URL it finds in the third database.

Now launch another legion of spiders which will start pulling domain-names out of the domain-name database. If [ignore] or [/ignore] have not been visited, visit those sites and record information found as the IP-marching slurping spiders do.

Now launch another legion of spiders which will start pulling URLs out of the URL database (checking first, of course, to see whether the URL has been previously visited) and slurping content, recording it as the other two legions of spiders do.

If you want to do this for all the internet, I recommend you take out a large huge immense brobdingnagian bank loan. Google uses, I believe, 20,000 (yes that's twenty thousand) Linux servers to run their search site and database. Their initial installation was 6,000, if I recall correctly.

To be honest, I don't think PHP is the best language for this. I recommend that you use a compiled programming language, like c or C++, for this.

Want the best answers? Ask the best questions: TANSTAAFL!!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top