Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Web Scraping 1

Status
Not open for further replies.

yeddish

IS-IT--Management
Jul 18, 2007
40
US
I am trying to do some web scraping to extract some numbers from various financial websites and I was wondering if anyone had any recommendations on good ways to do this.

I had code that would just parse the page in an inflexible fashion and extract the data, but I would like for this to be a bit more reliable than such a "brutish" method.

I am working with HTML::TreeBuilder::XPath right now and am having some problems, but I just wanted to pose the question and get some feedback.

Thanks!
-Joel
 
What's your question? I'm sure someone here can help
 
He's tried webscrapping with xpath
but was wondering if anyone has used something better or has any better suggestions.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those Who Say It Cannot Be Done Are Usually Interrupted by Someone Else Doing It; Give the wrong symptoms, get the wrong solutions;
 
Yeah, what travs69 said... Sorry, I suppose I didn't make myself very clear. %}

-Joel
 
You will need to be more specifc abo what you are scraping, otherwise search through all the HTML class of modules and find one that suits your specific needs. In general though, using an HTML parser is the way to go.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Ok. What I need to do is to be able to get some numbers (floats, mostly) from money.msn.com and other similar sites. These sites, of course, have a lot of tables and things of that nature.
I want to extract the data from these web pages and use the numbers to create some data that is more relevant than just the raw data provided. I also want the package to be somewhat resilient to changes in the basic format of the page.

If someone knows of a package better than HTML::TreeBuilder::XPath for doing this, I would like to know what that package is.

Does that describe the situation better? :)
-Joel
 
I think HTML::TokeParser is pretty popular for this sort of thing.
I'm not sure if I'm using it correctly in this example... I didn't have much time to read the documentation, but this will extract the last trade price for google inc. from yahoo finance by looking for the span id of "yfs_l10_goog" It's used twice on the page, so you'll get two results
Code:
require HTML::TokeParser;
$p = HTML::TokeParser->new("yahoo.htm") or die "Can't open: $!";


while (my $token = $p->get_tag("span")) {
 
    if ($token->[1]{id} eq "yfs_l10_goog"){
        my $text = $p->get_trimmed_text("/span");
        print $text,"\n";
    }
}
 
That looks handy. I will check that one out.
Thank you very much.
-Joel
 
See.. I knew there was someone here who knew something [yoda]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those Who Say It Cannot Be Done Are Usually Interrupted by Someone Else Doing It; Give the wrong symptoms, get the wrong solutions;
 
maybe HTML::TableExtractor if the data is inside of html tables.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
TableExtractor is actually what I'm using now. It does what it is designed for very well and is easy to use, but if a table is added anywhere above my data, my extraction is broken.

Thank you for the suggestion, though!
-Joel
 
Ok. Now that I got HTML::TreeBuilder::XPath to function properly, it is the best thing that I can use.
The problems that I was coming across were my own stupid fault (as usual). This is really a nice package.

Thanks for all the suggestions and help. :)
-Joel
 
Just a thought to make the process easier: Parse the XML from their RSS feeds (assuming that the data you want is in those feeds). At least then you're not dealing with presentation code.

 
That's a good idea, but I wanted to be able to get the raw numbers for any company, given the stock ticker symbol.
Like, I type in GOOG and it kicks me back the Book Value per Share for the last six years and the ROIC growth rates for 1 and 5 years, etc.
There's data from several pages of a company's info that I like to use, but I have to click around all over to get it. I just want to see all the numbers relevant to MY trading style on one page.

I'm actually jamming really well on this right now. I've got all my funky kinks worked out.

Thanks for all the help!
-Joel
 
Google doesn't have the info that I'm looking for on their site, actually, but in looking just now, I did notice that they had something else that I am interested in...

Thank you very much for the suggestion.
-Joel
 
Looks like they don't provide an API for Google finance, but you can access it with Google Spreadsheets, then access the spreadsheets with the spreadsheet API
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top