Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

parsing issues 2

Status
Not open for further replies.

CristianLuca

Programmer
Oct 25, 2007
36
RO
how can i extract a root using a pattern match ?

url type :
<h2 class=r><a href=\"
i use this :
<h2 class=r><a href=\"http:\/\/(.+?)\/\"/

but it grabs alot of things that i do not want.
for example : <h2 class=r><a href=\"http:
it greps it as correct . How can i make sure that the pattern of the url is " ?
xxx = {com,net,org ...}

Thanks ,
Cristian
 
Do this using two steps.

First extract hrefs using HTML::LinkExtor like travs suggests any other cpan module such as HTML::parser. And only then worry about the contents of the link.

I suggest you use the URI class in order to process href and determine the domain if that's what you want to do.

Basically, you should always try to isolate your logic into their respective parts. In this case, the first step is processing HTML. The second step is processing URI's.

- Miller
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top