Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Scraping an HTML file for specific data?

Status
Not open for further replies.

1DMF

Programmer
Jan 18, 2005
8,795
GB
I have managed to get what I need from
The problem I now have is how do I process the final returned HTML.

I've looked at HTML::parser, but I can't understand what this actually does, and to me seems simply something for generating HTML from a file.

maybe i'm wrong, but this is another typical CPAN module documentation, might as well be written in Maderine Chinese as far as i'm concerned.

So can someone advise how I traverse an HTML document returned by mechanise, to strip out the required data?

many thanks

1DMF.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
HTML::parser is extremely complicated. HTML::TokeParser::Simple will probably do what you need it to, and is far easier to use.
 
hey ishnid,

Yes I'm finding HTML::parser far to complicated, I've tried HTML::TreeBuilder,

I've ended up with and array of hashes after parsing the document, which again I don't understand what it represents, I did a loop -> print and got this
key = _done , value = 1
key = xmlns , value = key = _implicit_tags , value = 1
key = _tighten , value = 1
key = _head , value = HTML::Element=HASH(0x325ae64)
key = _store_comments , value = 0
key = _content , value = ARRAY(0x324fbb0)
key = _body , value = HTML::Element=HASH(0x32347f0)
key = _ignore_unknown , value = 1
key = _decl , value = HTML::Element=HASH(0x32053f4)
key = _pos , value =
key = _ignore_text , value = 0
key = xml:lang , value = en
key = _no_space_compacting , value = 0
key = _implicit_body_p_tag , value = 0
key = _warn , value = 0
key = _p_strict , value = 0
key = _hparser_xs_state , value = SCALAR(0x325e5fc)
key = _element_count , value = 3
key = _store_declarations , value = 1
key = _tag , value = html
key = _store_pis , value = 0
key = _element_class , value = HTML::Element

Makes no sense to me and still looks nothing like an HTML document I can traverse to grab data , i'm thinking of just rolling a regex and be done with it, but i'll give your 'HTML::TokeParser::Simple' a whirl first.



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
I give up ishnid, can't make head nor tail out of how I travers each node and strip away what I want till all that is left is the data i'm after.

I tried the tokeParser::Simple, and used the example method
my $parser = HTML::TokeParser::Simple->new(string => $cont);

while ( my $token = $parser->get_token ) {
next unless $token->is_tag('h2');
print $token->as_is, "<br />";
}

and all that got printed was, <h2></h2><br /> , the actual data between the tag is missing, exactly the opposite of what I want!

I want to grab this data not remove it!!

I can't work out what i'm doing with these parsers and so am opting for a regex, far simpler.

Any chance you could help with that thread?

Thanks,
1DMF.



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top