Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

screen scraping

Status
Not open for further replies.

LAdProg2005

Programmer
Feb 20, 2006
56
US
hello

i am new to both perl and screen scraping...

I need to get data from a table that resides on a web page..it is very simple web page that has html table with 7 columns and rows with data...

i need starting pointers or example or anything that can guide me in accomplishing this task while learning how it actually works...like how to start and end scraping and storing the data in variables etc...

any help is appreciated
thanks
 
I suggest you start by taking a look at a perl module called WWW::Mechanize.
You should be able to use it to collect the page in question from the site and then maybe with regular expressions collect the specific data you're looking for.

Have a go and if you get stuck come back to us with what you have and we'll try to help further.



Trojan.
 
I have never used but I have used LWP for this. I can't compare the two but I know LWP will do it also.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
ok, thanks for reply... i checked out both....

in simplicity here is what i came up with...but i am still weak at the scarping and comparing part...

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;
#use use HTML::TableExtract;
use HTML::parser;

my $url = 'http:testsite';
my $content = get $url or die "Couldn't get $url";
my $htex = HTML::TableExtract->new(headers=>['1st','2nd','Date']);
$htex->parse($content);

# Examine all matching tables
foreach my $table ($htex->table_states) {
# print "Table (", join(',', $table->coords), "):\n";
foreach my $row ($table->rows) {
# print join(',', @$row), "\n";
print @$row[0], @$row[1], @$row[2],"\n";
#print Dumper(@$row[1]);
}}
On the site the table looks like following:
Testing...
1st Location 2nd Defined Date AT Test
1 IL 2 Yes done
2 IL 3 Yes 03/10/2012 3 not done
3 IL 4 Yes 4 done
4 IL 5 Yes done


when i do source it has
<p align="center">Testing...</p>

<p align="center">
<table align="center" width="60%" border="1">
<tr align="CENTER" class="title">
<td width="10%" class="title">1st</td>
<td width="20%" class="title">Location</td>
<td width="10%" class="title">2nd</td>
<td width="10%" class="title">Defined</td>
<td width="10%" class="title">Date</td>
<td width="30%" class="title">AT</td>
<td width="10%" class="title">Test</td>
</tr>


<tr align="CENTER">
<td width="10%">
1
</td>
<td width="20%">
IL
</td>
<td width="10%">
2
</td>
<td width="10%">
Yes
</td>
<td width="10%">
&nbsp;
</td>
<td width="30%">
&nbsp;
</td>
<td width="10%">
done
</td>
</tr>

<tr align="CENTER">
<td width="10%">
2
</td>
<td width="20%">
IL
</td>
<td width="10%">
3
</td>
<td width="10%">
Yes
</td>
<td width="10%">
03/10/2012
</td>
<td width="30%">
3
</td>
<td width="10%">
not done
</td>
</tr>
</table>
</p>

and perl script output looks like:

Table (1,1):

1



Â

and so on with newline and spacing..

issue i see is the characters and where the date is empty it has another character instead of null that would be requried to pass to query...what i need to do is check the first and second element with the one in DB.

my thinking is saving all row[0] in one array row[1] in another and so on..
then loop over array one by one elembent and pass it to query to check...

any pointers? if i can get a dummy example or something off of above code i will appreciate it...thanks...
 
incase someone is looking to get rid of whitespaces: s/^\s+|\s+$//g; works great ....thanks for replies..it helped me get started...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top