Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How extract href's from the table?

Status
Not open for further replies.

vvv

Programmer
Jan 11, 2001
7
0
0
KR
Can i extract href's from the table
using modules TableExtract and LinkExtor?

thanks.
 
Sorry, I don't know anything about those two modules. However, if you explain your question, I can try to give you a method to acheive it. You have an HTML table and you want to retrieve the links from it?
Sincerely,

Tom Anderson
CEO, Order amid Chaos, Inc.
 
Yes, you're right,

I've got HTML document, which contains tables


<table .....
<td
<div
<a href=&quot;....&quot;; TEXT DATA1 </a>
<a href=&quot;....&quot;; TEXT DATA2 </a>
<a href=&quot;....&quot;; TEXT DATA3 </a>
and so on

CPAN HTML:TableExtract allows easily extract only cell data TEXT DATA,
how to quickly take out links from given table?
 
a little pattern matching will do this fairly quickly, with fewer system resources that loading and using the modules..... and it is pretty easy, once you've played with pattern matching a little....... maybe this should be my next faq.

open(HTML,&quot;<HTMLFILE_to_Open&quot;) or die &quot;Failed to open file, $!\n&quot;;
while (<HTML>) { $buffer .= $_; }
close HTML;

# while we match <table ... some stuff... /table>, catch the table chunk in $&amp;.
# do this in a 'while' in case there are multiple tables.
# <table>.....</table>
while ($buffer =~ /<table.*?\/table>/gis) # find and match all table chunks
{
$table = $&amp;;
# while we match <td ......><a href......>....</a></td>, catch each <a href...>
while ($table =~ /<td.*?(<a href.*?\/a>)\/td>/gis)
{
$href = $1;
print &quot;$href\n&quot;; # do something with what we caught.
}
}

I have not run this, but I think it is good. It might take a little tweaking to make it match the exact structure of the file your are trying to parse.

'hope this helps.


keep the rudder amid ship and beware the odd typo
 
i would change one thing:

<a href.*?

to:

<a.*?href.*?

you never know how many spaces these silly web designers put in betwwen their a and href, if they put href first at all. adam@aauser.com
 
Yup. Good idea. I expect that there will need to be a few other tweaks when applied to the specific file structure vvv is parsing. But, thanks for the critique.


keep the rudder amid ship and beware the odd typo
 
I'm not sure of how TableExtract works or what its functions return, but if you are using that to get the content from your cells, then you won't need to match the entire table as shown above, since your module will do that for you. Instead, just do the matching on the content returned by your TableExtract functions. For example, lets say that TableExtract::function() returned the content of one of your cells, then

my $content = TableExtract::function();

while ($content =~ /(<a.*?href.*?\/a>)/gis)
{
my $href = $1;
print &quot;$href\n&quot;; # do something with what we caught.
}

Of course, if you understand the pattern matching code written by goBoating and you haven't yet committed yourself to using the module, then I would suggest using that instead since you will have more power and flexibility than with using a module.
Sincerely,

Tom Anderson
CEO, Order amid Chaos, Inc.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top