Since many of you are longtime members of the perl community, I am hoping someone can offer some guidance.
I am looking for a search engine script that runs on a Unix server. We have no illusions about becoming another Yahoo or Google, so we're not wanting a top-of-the-line custom product ... in fact, ideally we'd find one reasonably priced "off-the-shelf".
The important thing is that it will index the html pages that are located on a long list of *external* url's that I would enter (perhaps a couple thousand), as opposed to only doing the pages within our own site (none of the external urls would be hosted in our own server's directory).
Can anyone make a recommendation?
Thanks....
=========================================
Here are some more details, if you think you know of a script...
=========================================
[1] As mentioned, because all the pages it would index would be located outside of our directory, this script must be capable of crawling/indexing specifically defined external url's;
[2] I'd prefer to have the option to be able to add all new url's myself, so we could review the website's content prior to adding it to the index (ie, inclusion is not automatic);
[3] It would be useful if we could set how deep the crawl would be on these external sites (to keep the size of the database manageable);
[4] Going along with #3, the spider would *only* go through pages located on the submitted domain. In other words, it would not continue to spider other sites which are linked to the submitted url;
[5] I'd like to be able to set the "look" of the results pages via html templates, and it would be best for us to have the opportunity to make other settings via a built-in control panel (or at the very least, clearly explained admin pages);
[6] It should at the least be able to initially handle a couple thousand url's (perhaps 20 to 30 thousand pages??).
[7] It would rank the returned results of a search query by their relevance.
--- These next features would be nice to have, but not essential ---
[1] We set which part of a page to index - meta tags, text, alt tags, etc;
[2] Some type of filtering feature (to block spamming, for example);
[3] The script could import url's from a deliniated database, and export in the same manner;
[4] We'd ideally want to set an automatic re-indexing schedule, to update content and remove dead links on a regular basis (though if we had to do this manually, that would be ok).
--------- Additional notes ---------
* It is not important for this script to categorize its results from these external sites;
* It is not important that this search engine have the ability to search the web at large if it could not find results in our own database of indexed pages;
* People from external sites will not be accessing any kind of account, so it is not necessary to have password protected access.
==========================
I am looking for a search engine script that runs on a Unix server. We have no illusions about becoming another Yahoo or Google, so we're not wanting a top-of-the-line custom product ... in fact, ideally we'd find one reasonably priced "off-the-shelf".
The important thing is that it will index the html pages that are located on a long list of *external* url's that I would enter (perhaps a couple thousand), as opposed to only doing the pages within our own site (none of the external urls would be hosted in our own server's directory).
Can anyone make a recommendation?
Thanks....
=========================================
Here are some more details, if you think you know of a script...
=========================================
[1] As mentioned, because all the pages it would index would be located outside of our directory, this script must be capable of crawling/indexing specifically defined external url's;
[2] I'd prefer to have the option to be able to add all new url's myself, so we could review the website's content prior to adding it to the index (ie, inclusion is not automatic);
[3] It would be useful if we could set how deep the crawl would be on these external sites (to keep the size of the database manageable);
[4] Going along with #3, the spider would *only* go through pages located on the submitted domain. In other words, it would not continue to spider other sites which are linked to the submitted url;
[5] I'd like to be able to set the "look" of the results pages via html templates, and it would be best for us to have the opportunity to make other settings via a built-in control panel (or at the very least, clearly explained admin pages);
[6] It should at the least be able to initially handle a couple thousand url's (perhaps 20 to 30 thousand pages??).
[7] It would rank the returned results of a search query by their relevance.
--- These next features would be nice to have, but not essential ---
[1] We set which part of a page to index - meta tags, text, alt tags, etc;
[2] Some type of filtering feature (to block spamming, for example);
[3] The script could import url's from a deliniated database, and export in the same manner;
[4] We'd ideally want to set an automatic re-indexing schedule, to update content and remove dead links on a regular basis (though if we had to do this manually, that would be ok).
--------- Additional notes ---------
* It is not important for this script to categorize its results from these external sites;
* It is not important that this search engine have the ability to search the web at large if it could not find results in our own database of indexed pages;
* People from external sites will not be accessing any kind of account, so it is not necessary to have password protected access.
==========================