Hi,
I'm trying to create a special kind of crawler that reads the HTML pages on a website to extract certain kinds of information, such as headings and copyright information. This information could be generated by a database, so would be contained on the same place on multiple HTML pages, or might be different depending on the webpage. The information that is to be extracted will need to be sent to a database on a central server, along with other information, which will form the basis of a special kind of search engine.
The pages that will be read by the spider could be from anyone, and the person that sets up the spider could be using any OS, so the spider needs to be cross-platform. I am guessing that having a Flash movie is going to be the easiest way to make everything user-friendly, and cross-platform. The user would setup the crawler within a special Flash browser, which would record all the relevant information needed to crawl the pages with the user settings. However I'm not certain that Flash is going to be capable to do everything that I'm looking for.
The things that I need to know are:
1) Is Flash capable of reading HTML files, to extract some information from them?
2) Is Flash capable of saving information internally (as strings, variables etc), so that it can be used or exported later?
3) Could it be set up to highlight certain parts of HTML code, such as a table, or a paragraph, so that users would find it easier to know which parts of the HTML they were extracting?
4) Can Flash run iterative processes, so that once one page has been set up, pages with a similar structure can be extracted automatically?
5) If Flash can do 4, is it possible to select a percentage of similarity, so that pages with similar, but not identical structure, could be dealt with automatically?
6) If Flash couldn't be automated to scan multiple pages that have the same layout, then how easy would it be to export some variables to a spider written in a conventional computer language, which would then do the searching?
I know of a plugin for Internet Explorer called WebTweezers that is a good example of highlighting HTML for a user to extract, but it can only deal with pages individually. Another program called Visual Web Task has a more comprehensive selection of search facilities, but is not very user-friendly, costs a lot, and is platform-specific. I need something that is easy to set up for any user, can run the same search on multiple pages, and can be read on any platform.
Any comments, or answers will be most gratefully received.
I'm trying to create a special kind of crawler that reads the HTML pages on a website to extract certain kinds of information, such as headings and copyright information. This information could be generated by a database, so would be contained on the same place on multiple HTML pages, or might be different depending on the webpage. The information that is to be extracted will need to be sent to a database on a central server, along with other information, which will form the basis of a special kind of search engine.
The pages that will be read by the spider could be from anyone, and the person that sets up the spider could be using any OS, so the spider needs to be cross-platform. I am guessing that having a Flash movie is going to be the easiest way to make everything user-friendly, and cross-platform. The user would setup the crawler within a special Flash browser, which would record all the relevant information needed to crawl the pages with the user settings. However I'm not certain that Flash is going to be capable to do everything that I'm looking for.
The things that I need to know are:
1) Is Flash capable of reading HTML files, to extract some information from them?
2) Is Flash capable of saving information internally (as strings, variables etc), so that it can be used or exported later?
3) Could it be set up to highlight certain parts of HTML code, such as a table, or a paragraph, so that users would find it easier to know which parts of the HTML they were extracting?
4) Can Flash run iterative processes, so that once one page has been set up, pages with a similar structure can be extracted automatically?
5) If Flash can do 4, is it possible to select a percentage of similarity, so that pages with similar, but not identical structure, could be dealt with automatically?
6) If Flash couldn't be automated to scan multiple pages that have the same layout, then how easy would it be to export some variables to a spider written in a conventional computer language, which would then do the searching?
I know of a plugin for Internet Explorer called WebTweezers that is a good example of highlighting HTML for a user to extract, but it can only deal with pages individually. Another program called Visual Web Task has a more comprehensive selection of search facilities, but is not very user-friendly, costs a lot, and is platform-specific. I need something that is easy to set up for any user, can run the same search on multiple pages, and can be read on any platform.
Any comments, or answers will be most gratefully received.