Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex not matching basec word? :'(

Status
Not open for further replies.

youradds

Programmer
Jun 27, 2001
817
GB
I've got a bit of a weird problem here. Basically, I'm trying to grab the contents of a page (in the below example, its google.com). I then try to find ALL of the words in the $RULES variable. The problem is, that its only matching 2 of the words (job and help). The code is;


--------------------------------------------------------------------------------
Code
--------------------------------------------------------------------------------

#!/usr/bin/perl

use strict;
use CGI::Carp qw(fatalsToBrowser);

print "Content-type: text/html \n\n";

my @RULES = qw(job help tools);

use LWP::Simple;
my $html = get(" $html =~ s/\n//g;

if ($html =~ /tools/sig) { print "yay!"; } else { print "damn"; }

# do counting of rules...needed later.
my $c_rules_cnt = $#RULES; # get the number of entries..

# now we have the count, lets do some checkingto see if we have all
# the required words...if not, then we need to skip this URL, and pass
# back false value, so we can report that it wasn't added...
my $count = 0;
foreach (@RULES) {
$_ =~ s/[\t\n]/g/;
print &quot;Word: $_ <BR>&quot;;
if ($html =~ /$_/sig) {
$count++;
print &quot;<font color=blue>Match good word.. $_ </font><BR>&quot;;
}
} # end 'foreach' for @c_rules

# if we didn't get *ALL* the rules, then we need to skip this one...



if ($count == $c_rules_cnt) { print &quot;<BR>Bad...<BR>&quot;; } else { print &quot;<BR>Good..<BR>&quot;; }

--------------------------------------------------------------------------------


If I add something like this;


--------------------------------------------------------------------------------
Code
--------------------------------------------------------------------------------

if ($html =~ /tools/i) { print &quot;yay!&quot;; } else { print &quot;damn&quot;; }

--------------------------------------------------------------------------------


...it shows &quot;yay!&quot; fine. Can anyone see why my code would do this? Is it a problem with my code, or a problem with the method I'm using?

Any ideas?

Cheers

Andy
 
Hi,

foreach (@RULES) {
$_ =~ s/[\t\n]/g/;
print &quot;Word: $_ <BR>&quot;;
...

What did you intend to do with &quot;$_ =~ s/[\t\n]/g/;&quot; ? I think it's useless, besides that it's wrong :p -- because the option s can't use character classes []. Only tr and y can!

Maybe you just try without this line. But i'm not sure if this is really the cause.

Greetings

--
Smash your head on keyboard to continue...
 
That should have been;

$_ =~ s/[\t\n]//g;

...but now I've simply changed it to;

$_ =~ s/\t//g;
$_ =~ s/\n//g;

...either way,. the problem is still there :(

Cheers

Andy


 
Hi,

i had tried your script (but maybe what i'm now going to say is totally shi**)

foreach (@RULES) {
print &quot;Word: $_ <BR>&quot;;
if ($html =~ /$_/sig) {
...

maybe you should just use &quot;if ($html =~ /$_/i)&quot;.

Greetings

--
Smash your head on keyboard to continue...
 
liuwt, you said:
What did you intend to do with &quot;$_ =~ s/[\t\n]/g/;&quot; ? I think it's useless, besides that it's wrong -- because the option s can't use character classes []. Only tr and y can!

While it does appear to be strange code (and later explained to be a typo), it appears you are saying that substitution regexes (&quot;the option s&quot; ???) can't make use of character classes defined by [], and on you go on to say that only tr and y CAN! This is wrong. Completely inverted from the truth. Please read up on regexes.

youradds:

Code:
#!/usr/bin/perl
 
use strict;
use CGI qw/header/;
use CGI::Carp qw(fatalsToBrowser);    
use LWP::Simple;
 
print header();
 
# my @RULES = qw(job help tools);
# Using all caps is a no no in most cases
# perl convention reserves all caps variables
# and function names to infer that they are used
# by perl automatically (like ENV, BEGIN, END, and so on)
# instead, say:

my @rules = qw(job help tools);

my $html = get(&quot;[URL unfurl="true"]http://www.google.com&quot;;);[/URL]

$html =~ s/\n//g;
 
if ($html =~ /tools/sig){ 
    print &quot;yay!&quot;; 
    } 
else { 
    print &quot;damn&quot;; 
    }
 
# do counting of rules...needed later.

# my $c_rules_cnt = $#RULES; 
# No no no... the $#array symbol DOES NOT return the number
# of elements in an array.  It returns the INDEX of the last
# element in the array.  So for an array with 3 elements
# (like yours) this $#RULES would return 2, that's for index
# values 0, 1, and 2.  Instead, read up on the difference
# between list and scalar context, and leverage that 
# knowledge to your benefit.

my $rules_count = @rules;
 
# now we have the count, lets do some checkingto see 
# if we have all the required words...if not, then we 
# need to skip this URL, and pass back a false value, 
# so we can report that it wasn't added...
 
my $count = 0;

foreach (@rules){

  # $_ =~ s/[\t\n]/g/;
  # learn to use $_ as the default value where it can 
  # be used.  In general, if you aren't using it as a
  # default value, you should not use it at all.
  # Instead use a more descriptive variable.  Like:
  # $line, $row, or in this case $rule
  # I'll go ahead and leave it as $_, but, against
  # better judgement.  There is only one real opportunity
  # to use it as a default value, and several times where
  # it's &quot;spelled out&quot;.  If you are spelling it out more
  # than you aren't.... you should just use a meaninful
  # variable name, not $_.  Also, $var, $x, and $foo are
  # not generally acceptable in maintainable code.

  s/[\t\n]//g;  # $_ is the default variable for regexes.
                # but since you define the @rules above,
                # there should never be an instance where
                # this will make a difference.

  print &quot;Word: $_ <BR>&quot;;  # this is OK.

  if ($html =~ /$_/sig){
    # in the regex above:
    # :the 's' modifier is a little useless, because you 
    # strip out all the newlines from $html earlier in 
    # your script.  But it doesn't hurt.
    # :the 'g' modifier is used to match multiple times,
    # but since your not looping, and your not catching
    # the list, it too is useless.
    # Read up on regexes - they are awesome.

    $count++;
    
    # 'heredocs' are cool.  use them when appropriate.
    print <<HTML;
      <font color=blue>
      Match good word.. $_ 
      </font><BR>
HTML
  }
 } # end 'foreach' for @rules
 
# if we didn't get *ALL* the rules, then we need to skip this one...
  
 
 
if($count == $rules_count){ 
  print &quot;<BR>Bad...<BR>&quot;; 
  } 
else{ 
  print &quot;<BR>Good..<BR>&quot;; 
  }

I believe the above is valid, if not aesthetically wanting. If you were seeing the matches, but still getting &quot;Bad...&quot; at the end, I think it was due to the misuse of $#array.

Hope this helps,

--jim
 
I managed to fix it. Turns out, that 'g' does not like working correctly within an 'if' statement :(

Anyway, thanks everyone for you help.

Cheers

Andy
 
I can't seem to figure out why this happens...
I read everything on the match operator in perlop, but I'm either not understanding everything, or the explanation isn't there.

As far as I can tell, when using the global match option, the search start position is being set to just after the first match within the foreach loop. Is this how it's supposed to work?


simplified version:
Code:
my @rules = qw(tools jobs);

$html = &quot;tools jobs&quot;;  #works with this string
#$html = &quot;jobs tools&quot;; #doesn't work with this one

for $rule (@rules){
    if ($html =~ /$rule/g) {
        print &quot;found $rule\n&quot;;
    }else{
        print &quot;didn't find $rule\n&quot;;
    }
}
 
@coderifous:

Ohh...yes, you are right...shame on me, what I told is wrong...Soorrryy!!!

:-(

--
Smash your head on keyboard to continue...
 
Chazoid: take off the /g modifier. You'll be fit to go.

liuwt: No shame since you were honorable in admitting your err. Most individuals would not do this.

--jim
 
Yeah, it was the 'g' in the 'if' that caused my problem :( Is there actually a reason why this happens, or is it just a limitation of Perl?

Cheers

Andy
 
The /g modifier makes the regex do different things, like, for every loop, it will attempt to begin matching at place where the last match was successful. So in your example, it finds the first search term at the END of the target text. Then, on the second iteration, since you are using the /g modifier, the regex begins searching at the end of the string (where the last match left off). Since there is nothing there, the match fails.

To see this behavior, do this:

Code:
my @rules = qw(tools jobs);

$html = &quot;tools jobs&quot;;  #works with this string
#$html = &quot;jobs tools&quot;; #doesn't work with this one

for $rule (@rules){

    print &quot;REGEX POSITION: &quot;, pos($html), &quot;\n&quot;;

    if ($html =~ /$rule/g) {
        print &quot;found $rule\n&quot;;
    }else{
        print &quot;didn't find $rule\n&quot;;
    }
}

As you can see, the position has a value indicating the end of the string. What's cool about pos() is that it's an Lvaluable function. You can, if you wanted, tell perl to begin searching at a certain location in the string using a combo of the /g modifier, and an assignment to pos():

Code:
my @rules = qw(tools jobs);

$html = &quot;tools jobs&quot;;  #works with this string
#$html = &quot;jobs tools&quot;; #doesn't work with this one

for $rule (@rules){
    pos($html) = 0;
    if ($html =~ /$rule/g) {
        print &quot;found $rule\n&quot;;
    }else{
        print &quot;didn't find $rule\n&quot;;
    }
}

As you can see, the regex performs as you had imagined it would. This is basically defeating the purpose of the /g though, and it makes more sense to just leave it off, than to assign 0 to pos().

--jim
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top