Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Boolean operators in PERL regexp? 1

Status
Not open for further replies.

lenrobert

Programmer
Feb 24, 2005
13
GB
Boolean operators in PERL regexp?

I am aware of OR ( | ), but is there logical NOT in the PERL regex syntax?

The task would be the following: to extract the relative links (i.e.. the href property of the "a" element) from an HTML file, even if it is not enclosed in quotation marks. This means I don't want to retrieve hyperlinks beginning with /, or # or javascript:

I would express the following string, and capture (or extract) the content of the second parenthesis.

( <a href=" OR <a href=) THEN NOT(/ OR # OR javascript: OR \s OR " ) THEN ( \s OR " )

The best regexp I could do is this, but it does not handle the case of / # javascript: etc.

/(<a href="|<a href=)([^"]*?)(\s|")/gi)

Does anyone know the answer, and help me? Thanks in advance,

Robert
 
Please post an example of your input and your expected output.

 
You can read up on `negative lookaheads' in perlretut (google for it or type `perldoc perlretut' on your command line). Here's a simplified example of its use:
Code:
my $string = '<a href="blah.html">';
if ( $string =~ /<a href="(?!#|\/|javascript|\s)([^"]+)">/ ) {
   print "Matched $1\n";
}
else {
   print "No match\n";
}
 
Ishnid, thank you for your answer it works, but it doesn't handle the case, when there are no quotation marks. On Google pages for example it is often the case. I have read about lookahead and lookbehind assertions, the only very-very big problem is that they are non-capturing.

Mikevh here is a code. The expected output specified in the comments.

The case of / # is OK, but if I would like to use the elegance of regexps, I should sacrifice either javascript: or the case when there are no quotation marks.

Code:
my @strings;

# These should not match
$strings[0] = '<a href="/slash.htm">';
$strings[1] = '<a href="#name">';
$strings[2] = '<a href="javascript:func()">';

# These should match
$strings[3] = '<a href=noquotemark.htm>';
$strings[4] = '<a href="quotemark.htm">';

my $i = 0;

# This is just a title row for the output
printf "%-6s %-13s %-20s %s", "ROW", "Y/N", "\$2", "\$3\n\n"; 

for my $element (@strings)
  {
    $element =~ /(<a href="|<a href=)([^"\/#]+)(\s|"|>)/gi;
    if ( $2 ne '' ) 
      {
       printf "%-20s %-20s %-20s %s", "Row #$i Match", $2, $3, "\n";
      }
    else 
      {
        print "Row #$i No match\n";
      }
  $i++;
  }
 
I am sure this coul dbe done with a single regexp, but with two this is pretty easy to get accomplished:

Code:
# These should not match
$strings[0] = '<a href="/slash.htm">';
$strings[1] = '<a href="#name">';
$strings[2] = '<a href="javascript:func()">';

# These should match
$strings[3] = '<a href=noquotemark.htm>';
$strings[4] = '<a href="quotemark.htm">';
$strings[5] = '<a href="">';# <--added an unusual string for testing

my $i = 0;

# This is just a title row for the output
printf "%-6s %-13s %-20s %s", "ROW", "Y/N", "\$2", "\$3\n\n";

for my $element (@strings)
  {
    $element =~ m/<a href="*([^"].+?)"*>/gi;
    if ($1 && $1 !~ m!^#|/|javascript!i)
      {
       printf "%-20s %-20s %-20s %s", "Row #$i Match", $1, "\n";
      }
    else
      {
        print "Row #$i No match\n";
      }
  $i++;
  }

You may never get this to work 100% without further checking though.
 
The regexp I posted was only to demonstrate the negative lookahead - it wasn't intended to do everything you're trying to do, as I felt that would over-complicate the example.

WRT capturing, because lookaheads are zero-width, there's technically nothing to capture. They're effectively an assertion that a particular condition exists at a certain point. In the example I posted previously, the lookahead is just checking that it's not followed by `#', `/', `javascript' or a space. Once that condition is true, the regexp continues matching and captures everything between the quotes into $1. If you were to change that to a positive lookahead (i.e. change ?! to ?=), again it's just checking to make sure it *is* followed by `#', `javascript' etc. If that's true, the entire string between the quotes will still be captured in $1, including whatever you're checking for.

I hope that all made sense.

This type of problem is why I normally recommend using a proper HMTL parser such as HTML::TokeParser::Simple
A proper parser will extract the href="whatever" bit for you (taking into account the type/lack of quotes used). Then it's just a case of testing what's in the attribute to see if it's what you're looking for.
 
KevinADC, Ishnid, thank you for your useful responses.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top