Create three databases. One is visited URLs, one is URLs to be visited, one is domain-names found.
Spawn a legion of spiders at attempt to go to port 80 on IP addresses 0.0.0.1 through 255.255.255.254. Have each spider record in the "visited URLs database" the IP address it visited. Have each spider record every domain-name found in all web content in the "found domain names" database. Have every spider record every URL it finds in the third database.
Now launch another legion of spiders which will start pulling domain-names out of the domain-name database. If [ignore]
or
[/ignore] have not been visited, visit those sites and record information found as the IP-marching slurping spiders do.
Now launch another legion of spiders which will start pulling URLs out of the URL database (checking first, of course, to see whether the URL has been previously visited) and slurping content, recording it as the other two legions of spiders do.
If you want to do this for all the internet, I recommend you take out a
large huge immense brobdingnagian bank loan. Google uses, I believe, 20,000 (yes that's twenty thousand) Linux servers to run their search site and database. Their initial installation was 6,000, if I recall correctly.
To be honest, I don't think PHP is the best language for this. I recommend that you use a compiled programming language, like c or C++, for this.
Want the best answers? Ask the best questions: TANSTAAFL!!