Web Scraping 1

yeddish · Jul 19, 2007

I am trying to do some web scraping to extract some numbers from various financial websites and I was wondering if anyone had any recommendations on good ways to do this.

I had code that would just parse the page in an inflexible fashion and extract the data, but I would like for this to be a bit more reliable than such a "brutish" method.

I am working with HTML::TreeBuilder::XPath right now and am having some problems, but I just wanted to pose the question and get some feedback.

Thanks!
-Joel

chazoid · Jul 19, 2007

What's your question? I'm sure someone here can help

travs69 · Jul 19, 2007

He's tried webscrapping with xpath

http://www.tek-tips.com/viewthread.cfm?qid=1389211&page=1

but was wondering if anyone has used something better or has any better suggestions.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those Who Say It Cannot Be Done Are Usually Interrupted by Someone Else Doing It; Give the wrong symptoms, get the wrong solutions;

yeddish · Jul 19, 2007

Yeah, what travs69 said... Sorry, I suppose I didn't make myself very clear. %}

-Joel

KevinADC · Jul 19, 2007

You will need to be more specifc abo what you are scraping, otherwise search through all the HTML class of modules and find one that suits your specific needs. In general though, using an HTML parser is the way to go.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

yeddish · Jul 19, 2007

Ok. What I need to do is to be able to get some numbers (floats, mostly) from money.msn.com and other similar sites. These sites, of course, have a lot of tables and things of that nature.
I want to extract the data from these web pages and use the numbers to create some data that is more relevant than just the raw data provided. I also want the package to be somewhat resilient to changes in the basic format of the page.

If someone knows of a package better than HTML::TreeBuilder::XPath for doing this, I would like to know what that package is.

Does that describe the situation better?

-Joel

chazoid · Jul 19, 2007

I think HTML::TokeParser is pretty popular for this sort of thing.
I'm not sure if I'm using it correctly in this example... I didn't have much time to read the documentation, but this will extract the last trade price for google inc. from yahoo finance by looking for the span id of "yfs_l10_goog" It's used twice on the page, so you'll get two results

Code:

require HTML::TokeParser;
$p = HTML::TokeParser->new("yahoo.htm") or die "Can't open: $!";


while (my $token = $p->get_tag("span")) {
 
    if ($token->[1]{id} eq "yfs_l10_goog"){
        my $text = $p->get_trimmed_text("/span");
        print $text,"\n";
    }
}

yeddish · Jul 19, 2007

That looks handy. I will check that one out.
Thank you very much.
-Joel

travs69 · Jul 19, 2007

See.. I knew there was someone here who knew something [yoda]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those Who Say It Cannot Be Done Are Usually Interrupted by Someone Else Doing It; Give the wrong symptoms, get the wrong solutions;

KevinADC · Jul 19, 2007

maybe HTML::TableExtractor if the data is inside of html tables.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

yeddish · Jul 19, 2007

TableExtractor is actually what I'm using now. It does what it is designed for very well and is easy to use, but if a table is added anywhere above my data, my extraction is broken.

Thank you for the suggestion, though!
-Joel

yeddish · Jul 20, 2007

Ok. Now that I got HTML::TreeBuilder::XPath to function properly, it is the best thing that I can use.
The problems that I was coming across were my own stupid fault (as usual). This is really a nice package.

Thanks for all the suggestions and help.

-Joel

brigmar · Jul 22, 2007

Just a thought to make the process easier: Parse the XML from their RSS feeds (assuming that the data you want is in those feeds). At least then you're not dealing with presentation code.

yeddish · Jul 23, 2007

That's a good idea, but I wanted to be able to get the raw numbers for any company, given the stock ticker symbol.
Like, I type in GOOG and it kicks me back the Book Value per Share for the last six years and the ROIC growth rates for 1 and 5 years, etc.
There's data from several pages of a company's info that I like to use, but I have to click around all over to get it. I just want to see all the numbers relevant to MY trading style on one page.

I'm actually jamming really well on this right now. I've got all my funky kinks worked out.

Thanks for all the help!
-Joel

brigmar · Jul 23, 2007

Google itself has what looks like very parsable pages at

http://finance.google.com/

yeddish · Jul 25, 2007

Google doesn't have the info that I'm looking for on their site, actually, but in looking just now, I did notice that they had something else that I am interested in...

Thank you very much for the suggestion.
-Joel

chazoid · Jul 25, 2007

Looks like they don't provide an API for Google finance, but you can access it with Google Spreadsheets, then access the spreadsheets with the spreadsheet API

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Web Scraping 1

yeddish

IS-IT--Management

chazoid

Technical User

travs69

MIS

yeddish

IS-IT--Management

KevinADC

Technical User

yeddish

IS-IT--Management

chazoid

Technical User

yeddish

IS-IT--Management

travs69

MIS

KevinADC

Technical User

yeddish

IS-IT--Management

yeddish

IS-IT--Management

brigmar

Programmer

yeddish

IS-IT--Management

brigmar

Programmer

yeddish

IS-IT--Management

chazoid

Technical User

Similar threads

Part and Inventory Search

Sponsor