Scraping an HTML file for specific data?

1DMF · Nov 13, 2007

I have managed to get what I need from

http://WWW::Mechanize.

The problem I now have is how do I process the final returned HTML.

I've looked at HTML:

arser, but I can't understand what this actually does, and to me seems simply something for generating HTML from a file.

maybe i'm wrong, but this is another typical CPAN module documentation, might as well be written in Maderine Chinese as far as i'm concerned.

So can someone advise how I traverse an HTML document returned by mechanise, to strip out the required data?

many thanks

1DMF.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!

ishnid · Nov 13, 2007

HTML:

arser is extremely complicated. HTML::TokeParser::Simple will probably do what you need it to, and is far easier to use.

1DMF · Nov 13, 2007

hey ishnid,

Yes I'm finding HTML:

arser far to complicated, I've tried HTML::TreeBuilder,

I've ended up with and array of hashes after parsing the document, which again I don't understand what it represents, I did a loop -> print and got this

key = _done , value = 1
key = xmlns , value =
http://www.w3.org/1999/xhtml
key = _implicit_tags , value = 1
key = _tighten , value = 1
key = _head , value = HTML::Element=HASH(0x325ae64)
key = _store_comments , value = 0
key = _content , value = ARRAY(0x324fbb0)
key = _body , value = HTML::Element=HASH(0x32347f0)
key = _ignore_unknown , value = 1
key = _decl , value = HTML::Element=HASH(0x32053f4)
key = _pos , value =
key = _ignore_text , value = 0
key = xml:lang , value = en
key = _no_space_compacting , value = 0
key = _implicit_body_p_tag , value = 0
key = _warn , value = 0
key = _p_strict , value = 0
key = _hparser_xs_state , value = SCALAR(0x325e5fc)
key = _element_count , value = 3
key = _store_declarations , value = 1
key = _tag , value = html
key = _store_pis , value = 0
key = _element_class , value = HTML::Element

Makes no sense to me and still looks nothing like an HTML document I can traverse to grab data , i'm thinking of just rolling a regex and be done with it, but i'll give your 'HTML::TokeParser::Simple' a whirl first.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!

1DMF · Nov 13, 2007

I give up ishnid, can't make head nor tail out of how I travers each node and strip away what I want till all that is left is the data i'm after.

I tried the tokeParser::Simple, and used the example method

my $parser = HTML::TokeParser::Simple->new(string => $cont);

while ( my $token = $parser->get_token ) {
next unless $token->is_tag('h2');
print $token->as_is, "<br />";
}

and all that got printed was, <h2></h2><br /> , the actual data between the tag is missing, exactly the opposite of what I want!

I want to grab this data not remove it!!

I can't work out what i'm doing with these parsers and so am opting for a regex, far simpler.

Any chance you could help with that thread?

Thanks,
1DMF.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Scraping an HTML file for specific data?

1DMF

Programmer

ishnid

Programmer

1DMF

Programmer

1DMF

Programmer

Similar threads

Part and Inventory Search

Sponsor