Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

PHP for BOT or Crawler? 1

Status
Not open for further replies.

admoore

IS-IT--Management
May 17, 2002
224
US
I need to create a web-bot or crawler to collect information from a group of web pages. I am NOT interested in harvesting email addresses for Spamming or anything nearly as nefarious as that, and will happily comply with robots.txt files; however, most of the bot script samples I have found are in JAVA or PERL- As I am already familiar w/ PHP and wish to parse the results into a MySQL database, I would prefer PHP.

All that aside I am specifically looking for either an example file in PHP which I could alter, or a hint in regard to how I "get" or open a specific URL in order to parse it?

TIA for any suggestions,

-Allen
 
Have a look at the function list on the PHP website:
The CURL (Client URL) functions can access remote files, which you can then scan using other functions.

Depending on what information you are trying to collect, then the DOM functions might be useful for extracting information from the page (they are for reading, editing and creating elements, attributes and values etc on XML files), and perhaps you might then want to use the Regular Expression funtions on the information extracted.

If you are trying to scan for other webpages (to do the crawling), then you could search a page for all the links on it, and add them to an array (which you could send to a db if you wanted for storage). Then you could search each of those pages, perhaps setting limits on depth (depending on how much of the web you want to crawl).

One suggestion I'd make, whatever you do, cache your script in a compiled state (Turcke MMcache seems to be the best - it's an extension you can get from Sourceforge - though there are many others). This will speed up the execution of your scripts, since they won't be compiled each time they are called (which if you are crawling lots of pages, will be often).

Keeping a persistent connection open to MySQL is probably a good idea too, since you don't want to be creating connections all the time.

Good luck!

Marcus.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top