Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegEx to extract URLs in HTML

Status
Not open for further replies.

LucL

Programmer
Jan 23, 2006
117
0
0
US
Hey Guys,

Can anyone provide me with a sample Java code to extract URLs (anything following a href=, between quotes). I've been messing with it all day and can't get it to work. All the code samples out there are for PHP.

I am using the matcher but it's always empty.

Thanks!
Luc L.
 
Hi

Regular expressions are quite similar in all languages. ( And tools too ). If you rewrite one of the codes you found, the regular expressions will be the easiest part.

But while the market leader browser pampered the amateurs and stupid tools by rendering any mess as web page, extracting an [tt]href[/tt] attribute with regular expression may not be easy in certain circumstances.

Of course, if the HTML is your and you know it is standard compliant, then is easy.

The below code is able to extract the URLs from this Tek-Tips page. For simplicity the page source is loaded into the [tt]html[/tt] variable.
Code:
String[] part=html.split([i]"</[aA]\\s*>"[/i]);
[b]for[/b] ([b]int[/b] i=0;i<part.length-1;i++)
  System.out.println(
    part[i]
      .replaceFirst([i]".*<a\\b"[/i],[i]""[/i])
      .replaceFirst([i]">.*"[/i],[i]""[/i])
      .replaceFirst([i]".*href=\"([^\"]*)\".*"[/i],[i]"$1"[/i])
  );

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top