Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Complicated regex's 1

Status
Not open for further replies.

vcherubini

Programmer
May 29, 2000
527
US
Hello:

I have a question about a complicated regex that I can not seem to get working.

Here is the code:

[tt]
#!usr/bin/perl

open(HTML, "link.html") || die ("can't open file: $!");
@links = <HTML>;
close(FILE);

@links = m/<A[^>]+?HREF\s*=\s*[&quot;´]?([^´&quot; >]+?)[ ´&quot;]?>/sig;

foreach $link (@links) {

print &quot;$link\n&quot;;

}
[/tt]

And here is the HTML file that it reads from:

[tt]

<html>
<head>
<title>this is some links</title>
</head>
<body>
<A HREF=&quot;link1.html&quot;>blah</A>
<A HREF=&quot;link1.html&quot;>blah2</A>
<A HREF=&quot;link1.html&quot;>blah3</A>
<A HREF=&quot;link1.html&quot;>blah4</A>
<A HREF=&quot;link1.html&quot;>blah5</A>
</body>
</html>
[/tt]

The regex says that it should parse out all of the A HREF's in the HTML, but, whenever I run it, I dont get anything, just a blank screen.

Any help on how this works is appreciated.

Thank you.


-Vic [sig]<p>vic cherubini<br><a href=mailto:malice365@hotmail.com>malice365@hotmail.com</a><br><a href= software</a><br>====<br>
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director<br>
Wants to Know: Java, Cold Fusion, Tcl/TK<br>
====[/sig]
 
The main problems I see with your code are:
1 - the match statement lacks a tilde(~) after the equal sign. Consequently, you are not doing a match, you are doing an assignment setting your array, @links, to the string following the equal sign.

2 - I think the match needs to go inside your loop..... or ......change the match to a replacement..... $str =~ s/find_this/replace_with_this/gis;

This works,
Code:
#!/usr/local/bin/perl -w
open(HTML, &quot;link.html&quot;) || die &quot;can't open file: $!&quot;;
@links = <HTML>;
close(HTML);

foreach $link (@links) 
	{
	$link =~ /<a href\s*=\s*&quot;(.*?)&quot;>(.*?)<\/a>/i;
	if ($1 && $2) { print &quot;LINK - $1\n\tLabel - $2\n\n&quot;; }
	}

'hope this helps.
[sig]<p> <br><a href=mailto: > </a><br><a href= > </a><br> keep the rudder amid ship and beware the odd typo[/sig]
 
one more thing..... the way I wrote that, you will only match one link in each line from your input file. If that needs to be addressed, we can get a little trickier. [sig]<p> <br><a href=mailto: > </a><br><a href= > </a><br> keep the rudder amid ship and beware the odd typo[/sig]
 
goBoating:

Thanks a ton for that. You definitly get my vote for TipMaster. Thanks a lot.

Now, like you said, that will only address one line of the HTML code from the page. I know it will be extremely tricky to do it to all lines, but if you have the time, could you please tell me how to do so?

Thanks a bunch for your help.

-Vic [sig]<p>vic cherubini<br><a href=mailto:malice365@hotmail.com>malice365@hotmail.com</a><br><a href= software</a><br>====<br>
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director<br>
Wants to Know: Java, Cold Fusion, Tcl/TK<br>
====[/sig]
 
Humbly submitted,
Remebering that there's always more than one way to do it......

My favorite way to catch all occurrences of a pattern in a stream is to use an evaluting replacement....huh?

In a simple match, you have a string and a pattern you want to match.... like this....
Code:
$str = 'some words that might contain a pattern to find';
if ($str =~ /a pattern/) { print &quot;Found a pattern\n&quot;; }

To do a replacement......
Code:
$str = 'some words that might contain a pattern1 to find';
$str =~ s/pattern1/patten2/;
# now the word 'pattern1' has been changed to 'pattern2'.
# as is the previous replacement finds and replaces the first occurrence only.

$str =~ s/pattern1/pattern2/gs;
# the 'g' says do it globally and the 's' says work across line boundaries.

a little further......
you can do a replace that evaluates the right side and uses a sub
routine to supply the replacement text....
Code:
# read the entire file into a var
open(HTML,&quot;<link.html&quot;) or die &quot;$!&quot;;
while (<HTML>) { $str .= $_; }
close HTML;

$str =~ /(pattern1)/&getNewString($1)/egs;
# the 'e' says evaluate the replacement, then use it.

sub getNewString 
{
my $var = $_[0];
# $var is now 'pattern1'
print $var; # change this to print to some previously opened output file.
return('pattern2');
}

more to the point....
Code:
# read the entire file into a var
open(HTML,&quot;<link.html&quot;) or die &quot;$!&quot;;
while (<HTML>) { $str .= $_; }
close HTML;

$str =~ s/<A HREF=[&quot;'](.*?)['&quot;]>(.*?)<\/A>/&catchParts($1,$2)/egis;

sub catchParts
{
my $link = $_[0];
my $label = $_[1];
print &quot;LINK - $link and LABEL - $label\n&quot;;
return('replace_string_is_not_important_here');
}


So, you can see that we can pass each occurrence of a pattern into the
sub routine. This is probably overkill for some situations, but, after playing
with this trick a little, I find myself using it more and more. In your situation, we
really don't care about what we pass back to the replace statement......only that in the sub routine we get each occurrence of the wanted pattern. At that point, we can do anything we want with it. I sure there must be a more concise way to do this trick, but, I find so much utility in this approach that I keep going back to it. If any of this does not make sense, please ask.

'hope this helps....
[sig]<p> <br><a href=mailto: > </a><br><a href= > </a><br> keep the rudder amid ship and beware the odd typo[/sig]
 
goBoating:

Thank you so much for the help. You don't know how much it does help, really.

Yeah, as always, we have TMTOWTDI in the wonderful language of Perl.

Thanks for all the help.

-Vic [sig]<p>vic cherubini<br><a href=mailto:malice365@hotmail.com>malice365@hotmail.com</a><br><a href= software</a><br>====<br>
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director<br>
Wants to Know: Java, Cold Fusion, Tcl/TK<br>
====[/sig]
 
Also:

I looked at that in school during class, so I didn't have time to look at it in a lot of detail, but as soon as I get home, I will.

Thanks again for the help.
-Vic [sig]<p>vic cherubini<br><a href=mailto:malice365@hotmail.com>malice365@hotmail.com</a><br><a href= software</a><br>====<br>
Knows: Perl, HTML, JavScript, C/C++, PHP, Flash, Director<br>
Wants to Know: Java, Cold Fusion, Tcl/TK<br>
====[/sig]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top