Regular expression

us111 · Feb 18, 2005

Hello

I'm looking for a url parsing.
I have a webpage. I can read the content but I'd like to
retrieve all urls.

1. <a href="a_link#a_spefific_anchor" ......>this is a test</a>

the result:
- the link
- the anchor
- the text

2. <a href="a_link#a_spefific_anchor" ......><img src="mypicture" ........... /></a>

the result:
- the link
- the anchor
- the picture link

Any ideas ??
Many thanks

ddrillich · Feb 20, 2005

Good Day,

Here is a rough start for #1:

Code:

# " -> ' in the anchor
set anchor "<a href='a_link#a_spefific_anchor'abcasdad>this is a test</a>"

regexp -nocase {(<a href=')([a-z0-9_]*)#([a-z0-9_]*)'[a-z0-9_]*>([ a-z0-9_]*)</a>} $anchor whole href link sp_anchor text


set text

-- Dan

us111 · Feb 21, 2005

thanks it looks to work for 1 anchor but what if I want to retrieve all urls ?

set anchor "something blabla....<a href='a_link#a_spefific_anchor'abcasdad>this is a test</a>something blabla........<a href='a_link2#a_spefific_anchor2'abcasdad>this is a test</a>something blabla........ "

I would like to get a list of all urls

marsd · Feb 21, 2005

The way to do this IMHO is to use the common mechanism of advancing the string indice till end of line while checking for the pattern

Code:

   while {[gets $fd line] > -1} {
          while {[regexp -indices $pat $line all out]} {
                 puts "[string range $line [lindex $out 0] [lindex $out 1]"
                 set line [string range [lindex $out 1] [string length $line]]
          }
}

Not tested and I don't have time to do other than conceptualize.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Regular expression

us111

Programmer

ddrillich

Technical User

us111

Programmer

marsd

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor