Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Can Flash read HTML files to extract information?

Status
Not open for further replies.

Maccaday

Technical User
Dec 9, 2003
71
GB
Hi,

I'm trying to create a special kind of crawler that reads the HTML pages on a website to extract certain kinds of information, such as headings and copyright information. This information could be generated by a database, so would be contained on the same place on multiple HTML pages, or might be different depending on the webpage. The information that is to be extracted will need to be sent to a database on a central server, along with other information, which will form the basis of a special kind of search engine.

The pages that will be read by the spider could be from anyone, and the person that sets up the spider could be using any OS, so the spider needs to be cross-platform. I am guessing that having a Flash movie is going to be the easiest way to make everything user-friendly, and cross-platform. The user would setup the crawler within a special Flash browser, which would record all the relevant information needed to crawl the pages with the user settings. However I'm not certain that Flash is going to be capable to do everything that I'm looking for.

The things that I need to know are:

1) Is Flash capable of reading HTML files, to extract some information from them?

2) Is Flash capable of saving information internally (as strings, variables etc), so that it can be used or exported later?

3) Could it be set up to highlight certain parts of HTML code, such as a table, or a paragraph, so that users would find it easier to know which parts of the HTML they were extracting?

4) Can Flash run iterative processes, so that once one page has been set up, pages with a similar structure can be extracted automatically?

5) If Flash can do 4, is it possible to select a percentage of similarity, so that pages with similar, but not identical structure, could be dealt with automatically?

6) If Flash couldn't be automated to scan multiple pages that have the same layout, then how easy would it be to export some variables to a spider written in a conventional computer language, which would then do the searching?


I know of a plugin for Internet Explorer called WebTweezers that is a good example of highlighting HTML for a user to extract, but it can only deal with pages individually. Another program called Visual Web Task has a more comprehensive selection of search facilities, but is not very user-friendly, costs a lot, and is platform-specific. I need something that is easy to set up for any user, can run the same search on multiple pages, and can be read on any platform.

Any comments, or answers will be most gratefully received.
 
Crawlers make my life miserable enough as it is. Don't tell me I'll now have to fight Flash crawlers?
No it's not possible with Flash, at least I dearly hope so! [evil]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top