parsing issues 2

CristianLuca · Jan 15, 2008

how can i extract a root using a pattern match ?

url type :
<h2 class=r><a href=\"

http://www.something.com/"

i use this :
<h2 class=r><a href=\"http:\/\/(.+?)\/\"/

but it grabs alot of things that i do not want.
for example : <h2 class=r><a href=\"http:

http://www.axcassd.com/sjdlksajds/djfjsdflksdf/lkdsjfkd/"

it greps it as correct . How can i make sure that the pattern of the url is "

http://wwww.something.xxx/"

?
xxx = {com,net,org ...}

Thanks ,
Cristian

travs69 · Jan 15, 2008

You might want to look at

http://search.cpan.org/~bdfoy/HTML-SimpleLinkExtor-1.19/lib/SimpleLinkExtor.pm

and are you sure you want

http://wwww.?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

MillerH · Jan 15, 2008

Do this using two steps.

First extract hrefs using HTML::LinkExtor like travs suggests any other cpan module such as HTML:

arser. And only then worry about the contents of the link.

I suggest you use the URI class in order to process href and determine the domain if that's what you want to do.

Basically, you should always try to isolate your logic into their respective parts. In this case, the first step is processing HTML. The second step is processing URI's.

- Miller

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

parsing issues 2

CristianLuca

Programmer

travs69

MIS

MillerH

Programmer

Similar threads

Part and Inventory Search

Sponsor