Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regular expression

Status
Not open for further replies.

us111

Programmer
Feb 18, 2005
2
LU
Hello

I'm looking for a url parsing.
I have a webpage. I can read the content but I'd like to
retrieve all urls.

1. <a href="a_link#a_spefific_anchor" ......>this is a test</a>

the result:
- the link
- the anchor
- the text

2. <a href="a_link#a_spefific_anchor" ......><img src="mypicture" ........... /></a>

the result:
- the link
- the anchor
- the picture link

Any ideas ??
Many thanks
 
Good Day,

Here is a rough start for #1:

Code:
# " -> ' in the anchor
set anchor "<a href='a_link#a_spefific_anchor'abcasdad>this is a test</a>"

regexp -nocase {(<a href=')([a-z0-9_]*)#([a-z0-9_]*)'[a-z0-9_]*>([ a-z0-9_]*)</a>} $anchor whole href link sp_anchor text


set text

-- Dan
 
thanks it looks to work for 1 anchor but what if I want to retrieve all urls ?


set anchor "something blabla....<a href='a_link#a_spefific_anchor'abcasdad>this is a test</a>something blabla........<a href='a_link2#a_spefific_anchor2'abcasdad>this is a test</a>something blabla........ "

I would like to get a list of all urls
 
The way to do this IMHO is to use the common mechanism of advancing the string indice till end of line while checking for the pattern
Code:
   while {[gets $fd line] > -1} {
          while {[regexp -indices $pat $line all out]} {
                 puts "[string range $line [lindex $out 0] [lindex $out 1]"
                 set line [string range [lindex $out 1] [string length $line]]
          }
}

Not tested and I don't have time to do other than conceptualize.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top