Boolean operators in PERL regexp? 1

lenrobert · Feb 24, 2005

Boolean operators in PERL regexp?

I am aware of OR ( | ), but is there logical NOT in the PERL regex syntax?

The task would be the following: to extract the relative links (i.e.. the href property of the "a" element) from an HTML file, even if it is not enclosed in quotation marks. This means I don't want to retrieve hyperlinks beginning with /, or # or javascript:

I would express the following string, and capture (or extract) the content of the second parenthesis.

( <a href=" OR <a href=) THEN NOT(/ OR # OR javascript: OR \s OR " ) THEN ( \s OR " )

The best regexp I could do is this, but it does not handle the case of / # javascript: etc.

/(<a href="|<a href=)([^"]*?)(\s|")/gi)

Does anyone know the answer, and help me? Thanks in advance,

Robert

mikevh · Feb 24, 2005

Please post an example of your input and your expected output.

mlibeson · Feb 24, 2005

if (! (0 == 1)) { print "True\n"; }

Michael Libeson

ishnid · Feb 25, 2005

You can read up on `negative lookaheads' in perlretut (google for it or type `perldoc perlretut' on your command line). Here's a simplified example of its use:

Code:

my $string = '<a href="blah.html">';
if ( $string =~ /<a href="(?!#|\/|javascript|\s)([^"]+)">/ ) {
   print "Matched $1\n";
}
else {
   print "No match\n";
}

lenrobert · Feb 25, 2005

Ishnid, thank you for your answer it works, but it doesn't handle the case, when there are no quotation marks. On Google pages for example it is often the case. I have read about lookahead and lookbehind assertions, the only very-very big problem is that they are non-capturing.

Mikevh here is a code. The expected output specified in the comments.

The case of / # is OK, but if I would like to use the elegance of regexps, I should sacrifice either javascript: or the case when there are no quotation marks.

Code:

my @strings;

# These should not match
$strings[0] = '<a href="/slash.htm">';
$strings[1] = '<a href="#name">';
$strings[2] = '<a href="javascript:func()">';

# These should match
$strings[3] = '<a href=noquotemark.htm>';
$strings[4] = '<a href="quotemark.htm">';

my $i = 0;

# This is just a title row for the output
printf "%-6s %-13s %-20s %s", "ROW", "Y/N", "\$2", "\$3\n\n"; 

for my $element (@strings)
  {
    $element =~ /(<a href="|<a href=)([^"\/#]+)(\s|"|>)/gi;
    if ( $2 ne '' ) 
      {
       printf "%-20s %-20s %-20s %s", "Row #$i Match", $2, $3, "\n";
      }
    else 
      {
        print "Row #$i No match\n";
      }
  $i++;
  }

KevinADC · Feb 25, 2005

I am sure this coul dbe done with a single regexp, but with two this is pretty easy to get accomplished:

Code:

# These should not match
$strings[0] = '<a href="/slash.htm">';
$strings[1] = '<a href="#name">';
$strings[2] = '<a href="javascript:func()">';

# These should match
$strings[3] = '<a href=noquotemark.htm>';
$strings[4] = '<a href="quotemark.htm">';
$strings[5] = '<a href="">';# <--added an unusual string for testing

my $i = 0;

# This is just a title row for the output
printf "%-6s %-13s %-20s %s", "ROW", "Y/N", "\$2", "\$3\n\n";

for my $element (@strings)
  {
    $element =~ m/<a href="*([^"].+?)"*>/gi;
    if ($1 && $1 !~ m!^#|/|javascript!i)
      {
       printf "%-20s %-20s %-20s %s", "Row #$i Match", $1, "\n";
      }
    else
      {
        print "Row #$i No match\n";
      }
  $i++;
  }

You may never get this to work 100% without further checking though.

ishnid · Feb 26, 2005

The regexp I posted was only to demonstrate the negative lookahead - it wasn't intended to do everything you're trying to do, as I felt that would over-complicate the example.

WRT capturing, because lookaheads are zero-width, there's technically nothing to capture. They're effectively an assertion that a particular condition exists at a certain point. In the example I posted previously, the lookahead is just checking that it's not followed by `#', `/', `javascript' or a space. Once that condition is true, the regexp continues matching and captures everything between the quotes into $1. If you were to change that to a positive lookahead (i.e. change ?! to ?=), again it's just checking to make sure it *is* followed by `#', `javascript' etc. If that's true, the entire string between the quotes will still be captured in $1, including whatever you're checking for.

I hope that all made sense.

This type of problem is why I normally recommend using a proper HMTL parser such as HTML::TokeParser::Simple
A proper parser will extract the href="whatever" bit for you (taking into account the type/lack of quotes used). Then it's just a case of testing what's in the attribute to see if it's what you're looking for.

lenrobert · Feb 27, 2005

KevinADC, Ishnid, thank you for your useful responses.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Boolean operators in PERL regexp? 1

lenrobert

Programmer

mikevh

Programmer

mlibeson

Programmer

ishnid

Programmer

lenrobert

Programmer

KevinADC

Technical User

ishnid

Programmer

lenrobert

Programmer

Similar threads

Part and Inventory Search

Sponsor