Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations biv343 on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

find all links on a page 1

Status
Not open for further replies.

peacecodotnet

Programmer
May 17, 2004
123
US
You need a PCRE regular expression and employ preg_match_all() with it.
Code:
<?
$text = "<A name=\"whatever\" class=\"myclass\" href=\"here.com/test.html\" id=\"whatever\">";
$pattern = '/<a[^>]*href\s*=\s*[\'"]([^\'"]*)[\'"][^>]*>/i';
preg_match_all($pattern,$text,$result);
echo"<pre>";
print_r($result);
?>
Pattern explanation:
Code:
/    => opening delimiter
<a   => literals left angle bracket plus letter a
[    => begin character class definition
^    => anything but following chars
>    => literal right angle bracker
]    => end character class definition
*    => none or more occurrences
href => literal href attribute
\s*  => none or multiple whitespace characters
=    => literal equal sign
\s*  => none or multiple whitespace characters
[    => begin character class definition
\'   => (escaped) single quote; needed to keep the PHP parse continue the pattern
"    => double quote
]    => end character class definition
     => the char class means a single or a double quote
     => if you need to catch bad html use the asterisk to allow none or more occurrences
(    => begin subpattern - that's the thing we really want
[    => begin character class definition
^    => anything but following chars
\'   => (escaped) single quote; needed to keep the PHP parse continue the pattern
"    => double quote
]    => end character class definition
*    => none or more occurrences
)    => end capture subpattern
[    => begin character class definition
\'   => (escaped) single quote; needed to keep the PHP parse continue the pattern
"    => double quote
]    => end character class definition
     => the char class means a single or a double quote
     => if you need to catch bad html use the asterisk to allow none or more occurrences
[    => begin character class definition
^    => anything but following chars
>    => literal right angle bracker
]    => end character class definition
*    => none or more occurrences
     => this is anything else left inside the tag
>    => literal right angle bracket, the end of the tag
/    => end delimiter for pattern definition
i    => case insensitive

Patterns are like compositions in the world of music. Many ways. Hope this helps.
 
Ok, i dont understnad that crap all to much, but this should work for my solution?
If so, Thanks so much. I'll try right now!
 
Depending on the control you have over the pages you're reading... you may or may not want to make the single and double quoting around the location attribute optional.... although the pattern then gets much more tricky as when they're missing it's
Code:
 [^ ] rather than [^'"]

Additionally, you can throw in some back matching to make sure that the single and double quoting matches properly... i.e., valid HTML (though it would never go anywhere) could be

Code:
<a href="[URL unfurl="true"]www.go'ogle.com">[/URL]
and not so valid, but then again single quotes technically never are
<a href='[URL unfurl="true"]www.go"ogle.com'>[/URL]

And the above pattern would return [URL unfurl="true"]www.go[/URL] for both of them.

And of course, if you're looking at the average page you can expect to run into
<a href=[URL unfurl="true"]www.google.com>[/URL]
Which that pattern would miss

I only mention that because the above question often refers to people writing link checkers and error finders, so I thought it may be pertinent.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top