Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

REGEX multiple matches in one line

Status
Not open for further replies.

NetworkGhost

IS-IT--Management
Apr 12, 2005
1,324
US
Made a script to get google results. When I parse the file it seems google does not include carriage returns. What I need to do is regex all the links from the file. If there are multiple matches in one line then I need to pull all occurrences of the string.

so this is what I was working with:

regexp {a.*craig[^>]*>(.*?)</a>} $line url1

I have tried a few things like outputing to more variables, doesnt seem to work. Can regex do this for me? Any help is appreciated. the other option I thought would be to split the results at every hyperlink. Will try that next.

 
I'm sure you can form a regular expression that does what you want. What I don't know is what you want. What do you want regexp to find? That is, what are the characteristics of the strings you're looking for?

_________________
Bob Rashkin
 
What I did was a wget for a google string. The information returned doesnt have carriage returns so I get a big line like below:

<a href="list.org&um=1&ie=UTF-8&sa=N&tab=wb">Blogs</a></span> <span class=gb2><div></div></a></span> <span class=gb2><a href="nt=firefox-a&rls=org.mozilla:en-US:eek:fficial&q=Cisco+ASA+cpg+2008-03-02+site:craigslist.org&um=1&ie=UTF-8&sa=N&tab=w1">YouTube</a></span> <span class=gb2><a hr
ef=" <span class=gb2><a href="hl=en&client=firefox-a&rls=org.mozilla:en-US:eek:fficial&q=Cisco+ASA+cpg+2008-03-02+site:craigslist.org&um=1&ie=UTF-8&sa=N&tab=wq">Photos</a></span> <span class=
gb2><a href=" <span class=gb2><a href="firefox-a&rls=org.mozilla:en-US:eek:fficial&q=Cisco+ASA+cpg+2008-03-02+site:craigslist.org&um=1&ie=UTF-8&sa=N&tab=wy">Reader</a></span> <span class=gb2><div></di
v></a></span> <span class=gb2><a href=" more &raquo;</a></span> </nobr></div><div class=gbh style=left:0></div><di
v class=gbh style=right:0></div><div align=right id=guser style="font-size:84%;padding:0 0 4px" width=100%><nobr><a href="n?continue=253Acraigslist.org%26btnG%3DSearch&hl=en">Sign in</a></nobr></div><table class=tb style=clear:left width=100%><tr><form name=gs method=GET action=/search><td
class=tc valign=top><a id=logo href=" title="Go to Google Home">Google<span></span></a></td><td style="padding:0 0 7px;paddi
ng-left:8px" valign=top width=100%><table class=tb style=margin-top:25px><tr><td class=tc nowrap><input type=hidden name=hl value="en"><input type=hidden nam
e=client value="firefox-a"><input type=hidden name=rls value="org.mozilla:en-US:eek:fficial"><input type=hidden name=hs value="4uf"><input type=text name=q size=
49 maxlength=2048 value="Cisco ASA cpg 2008-03-02 site:craigslist.org" title="Search"> <input type=submit name="btnG" value="Search"></td><td class=tc nowrap
width=100%><span id=ap>&nbsp;&nbsp;<a href=/advanced_search?q=Cisco+ASA+cpg+2008-03-02+site:craigslist.org&hl=en&client=firefox-a&rls=org.mozilla:en-US:eek:ffici
al&hs=4uf>Advanced Search</a><br>&nbsp; <a href=/preferences?q=Cisco+ASA+cpg+2008-03-02+site:craigslist.org&hl=en&client=firefox-a&rls=org.mozilla:en-US:eek:ffic
ial&hs=4uf>Preferences</a></span></td></tr></table></td></tr></form></table><table border=0 cellpadding=0 cellspacing=0 width=100% class="t bt"><tr><td nowrap
><span id=sd>&nbsp;Web&nbsp;</span></td><td align=right nowrap><font size=-1>Results <b>1</b> - <b>1</b> of <b>1</b> from <b>craigslist.org</b> for <b>Cisco A
SA cpg 2008-03-02</b>. (<b>0.08</b> seconds)&nbsp;</font></td></tr></table> <div id=res><!--a--><div><div class=g><!--m--><h2 class=r><a href="craigslist.org/sby/cpg/593782938.html" class=l onmousedown="return clk(this.href,'','','res','1','')">need help setting up a <b>cisco asa</b> 5505</a><


What I want to do is pull all hyperlinks from that line and then create a list.

 
So let's say you want a list of strings, each element of which is the substring lying between "<a href=" and ">", right?

I'm really no expert in regular expressions (far from it) so maybe one or another will pipe in.

Let's say you have a string:
<a href="abc.123.bbb all 4"> lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4"> lkjlkj; aii3 aldjkl;j;<a href
="abc.123.bbb all 4"> lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4"> lkjlkj;
aii3 aldjkl;j;<a href="abc.123.bbb all 4"> lkjlkj; aii3 aldjkl;j;<a href="abc.1
23.bbb all 4"> lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4"> lkjlkj; aii3 a
ldjkl;j;


If we do the following substitution:
regsub -all {(a href=")([^"]*)(\">)} $a "&\n" b
then "b" will now be:
[tt]<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;<a href="abc.123.bbb all 4">
lkjlkj; aii3 aldjkl;j;[/tt]
Not quite what you want but there's now one link per line.
Now for each line (foreach line $b {, we can get only from "<" to ">":
set i1 [string first < $line]
set i2 [string first < $line]
set subline [string range $line $i1 $i2]


Now you can do the same thing with the quotes:
set i3 [string first {="} $subline]
set i4 [string first {">} $subline]
incr i3 2;#to get past the quote
incr i4 -2;#to stop before the quote
set subline2 [string range $subline $i3 $i4]




_________________
Bob Rashkin
 
If you have egrep available you can also try something like this:
Code:
egrep -o 'href=.*?' /path/to/linkfile
and see if it does what you want.

HTH
 
This may be fewer steps (maybe not).
1. strip off anything before the first link
2. build the regular expression to find the links (between the quotes)
3. using regsub, get those links only, separated by "|" (I can't get it to insert the linefeed).
4. split that string on "|"
5. join that list with "\n"

starting with a string, a:
aaa <a href="xyz.abc.com/fgh ?abc">1a2b 3c4d</a> g6j7k9aaa <a href="xyz.abc.com/
fgh ?abc">1a2b 3c4d</a> g6j7k9aaa <a href="xyz.abc.com/fgh ?abc">1a2b 3c4d</a> g
6j7k9aaa <a href="xyz.abc.com/fgh ?abc">1a2b 3c4d</a> g6j7k9
Code:
regsub {[^<]*} $a {} b; #1
set re {(<a href=\")([^\"]*)(\">[^<]*</a>[^<]*)}; #2
regsub -all $re $b {\2 |} c; #3
set lstc [split $c |]; #4
set strc [join $lstc \n]

after #1, b is now:
<a href="xyz.abc.com/fgh ?abc">1a2b 3c4d</a> g6j7k9aaa <a href="xyz.abc.com/fgh
?abc">1a2b 3c4d</a> g6j7k9aaa <a href="xyz.abc.com/fgh ?abc">1a2b 3c4d</a> g6j7k
9aaa <a href="xyz.abc.com/fgh ?abc">1a2b 3c4d</a> g6j7k9

after #3, c is now:
xyz.abc.com/fgh ?abc |xyz.abc.com/fgh ?abc |xyz.abc.com/fgh ?abc |xyz.abc.com/fg
h ?abc |

after #4, lstc is now:
{xyz.abc.com/fgh ?abc } {xyz.abc.com/fgh ?abc } {xyz.abc.com/fgh ?abc } {xyz.abc
.com/fgh ?abc } {}

after #5, strc is now:
xyz.abc.com/fgh ?abc
xyz.abc.com/fgh ?abc
xyz.abc.com/fgh ?abc
xyz.abc.com/fgh ?abc

_________________
Bob Rashkin
 
Obviously steps 4 and 5 can be combined:
set strc [join [split $c |] \n]

_________________
Bob Rashkin
 
This does what you seem to be looking for
Code:
set mylinks [regexp -all -inline {<a .+?>(.*?)</a>} $line]
foreach {tag cont} $mylinks {
  if [string first craig $tag]>0 {
    ... process $cont here ...
  }
}
That is it returns the content of the hyperlinks containing 'craig' in the whole hyperlink. You need to extract first all the links, then filter out those that do not contain the searched string: can't see a way of obtaining directly only those.
Note that the list [tt]mylinks[/tt] will inevitably contain also the full matched expressions, so the contents of the hyperlink occupy every 2nd item in the list (just to be remembered when parsing the list, but also useful for filtering).
Note that your string seem to be malformed: there is at least a [tt]</a>[/tt] alone.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top