Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need help in construct an regexp 1

Status
Not open for further replies.

whn

Programmer
Oct 14, 2007
265
US
I need to find a string match at Microsoft site. Out of its web page, there could be one or more strings look like this:

<a class="download" onclick="return false;" href="confirmation.aspx?id=36888" bi:fileurl="8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu"

var downloadFileUrl = "8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu" ;

$("#ctl00_ctl21_ColumnRepeater_ctl00_RowRepeater_ctl01_CellRepeater_ctl00_ctl01").details({ "downloadUrl": "8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu", "enableAtlasActionTag": true, "atlasActionTag": ""

The regexp I came up is somewhat like this:
Code:
my $kb = '[b]KB2809289[/b]';
my $pattern = qq/http:\/\/download\.microsoft\.com\/download[b][COLOR=#EF2929]([\\d+\\D+\\w+\W+]+)[/color]+[/b]$kb\.msu/;

I then have two implementations:

Implementation I:
Code:
  my $pattern = qq/http:\/\/download\.microsoft\.com\/download([\\d+\\D+\\w+\W+]+)+$kbName\.msu/;
  my $contents = `cat $srce`; # The Microsoft page has been saved as a local file
  my $i = 1;
  while($contents =~ /($pattern)/g) {
    my $match = $&;
    print ("\$i = $i, $match\n");
    $i++;
  }
The error I got is like this:
Code:
Complex regular subexpression recursion limit (32766) exceeded at <file name> line 3348.

Implementation II:
Code:
  if(open(FH, $srce)) {
    my $i = 1;
    while(my $line = <FH>) {
      if($line =~ /($pattern)/g) {
        my $match = $1;
        print ("\$i = $i, $match\n");
      }
      $i++;
    }
    close(FH);
  }
No match is found. So I know the regexp is incorrect.

Please help me in two areas:
1) fix my regexp;
2) with a correct regexp, would I still get this error - Complex regular subexpression recursion limit (32766) exceeded?

Many thanks!!
 
Sorry for not having myself clear.

I need to extract "http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu" from an html file. The strings in bold red font are known. A few sample lines in that html file were listed in my original post.


Thanks.
 
Hi

Oops. I had the feeling you want only certain parts of the URLs.

Then I would do it like this :
Perl:
[b]my[/b] [navy]$kb[/navy] [teal]=[/teal] [green][i]'KB2809289'[/i][/green][teal];[/teal]
[b]my[/b] [navy]$pattern1[/navy] [teal]=[/teal] [b]qq[/b][green][i]{[URL unfurl="true"]http://download.microsoft.com/download.+?-$kb-\\w+.msu}[/URL][/i][/green][teal];[/teal]    [gray]# string[/gray]
[b]my[/b] [navy]$pattern2[/navy] [teal]=[/teal] [b]qr[/b][fuchsia]{[URL unfurl="true"]http://download\.microsoft\.com/download/.+?-[/URL][/fuchsia][navy]$kb[/navy][fuchsia]-\w+\.msu}[/fuchsia][teal];[/teal] [gray]# regular expression[/gray]

[b]my[/b] [navy]$contents[/navy] [teal]=[/teal] [b]do[/b] [teal]{[/teal] [b]local[/b] [navy]$/[/navy][teal];[/teal] [green][i]<DATA>[/i][/green] [teal]}[/teal][teal];[/teal]

[b]my[/b] [navy]$i[/navy] [teal]=[/teal] [purple]1[/purple][teal];[/teal]
[b]while[/b] [teal]([/teal][navy]$contents[/navy] [teal]=~[/teal] [green][i]/$pattern1/[/i][/green][b]g[/b][teal])[/teal] [teal]{[/teal]
  [b]my[/b] [navy]$match[/navy] [teal]=[/teal] $[teal]&;[/teal]
  [b]print[/b] [green][i]"\$i = $i, $match\n"[/i][/green][teal];[/teal]
  [navy]$i[/navy][teal]++;[/teal]
[teal]}[/teal]

__DATA__
<a class="download" onclick="return false;" href="confirmation.aspx?id=36888" bi:fileurl="
[URL unfurl="true"]http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu"[/URL]

var downloadFileUrl = "[URL unfurl="true"]http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu"[/URL] ;

$("#ctl00_ctl21_ColumnRepeater_ctl00_RowRepeater_ctl01_CellRepeater_ctl00_ctl01").details({ "downloadUrl": "[URL unfurl="true"]http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu",[/URL] "enableAtlasActionTag": true, "atlasActionTag": ""

Note that using the $pattern1 string or the $pattern2 regular expression does the same. I included both as I see you mixed abit the their syntax.

Feherke.
[link feherke.github.com/][/url]
 
Thank you so much, Feherke! You are the man!!
 
Hi Feherke,

I have a follow-up question.

I modified your code a bit. Please also note that I modified the input data, too. All my changes are in blue bold fonts.

I noticed that when regexp is used, then the case-insensitive match would not work. Is this the way supposed to be?

Again, thank you so much for your help.

Code:
my $kb = 'KB2809289';
[b][COLOR=#3465A4]my $cpuType = 'x86'; # passed in, could be in upper case, too
my $cpuTypeInL = lc($cpuType);
my $cpuTypeInU = uc($cpuType);[/color][/b]

# String
my $pattern1 = qq{[URL unfurl="true"]http://download.microsoft.com/download.+?-$kb-$cpuTypeInL.*?.msu};[/URL]
my $pattern2 = qq{[URL unfurl="true"]http://download.microsoft.com/download.+?-$kb-$cpuTypeInU.*?.msu};[/URL]

# regular expression
my $pattern3 = qr{[URL unfurl="true"]http://download\.microsoft\.com/download/.+?-$kb-$cpuTypeInL.*?.msu};[/URL]
my $pattern4 = qr{[URL unfurl="true"]http://download\.microsoft\.com/download/.+?-$kb-$cpuTypeInU.*?.msu};[/URL]

[b][COLOR=#3465A4]my $p;
#$p = $pattern1; # string match - match all 3
#$p = $pattern2; # string match - match all 3
#$p = $pattern3; # regexp match - only match 2
$p = $pattern4;  # regexp match - only match 1[/color][/b]

my $contents = do { local $/; <DATA> };
if($p =~ /x86/) {
  print "Lower Case Pattern: $p\n";
}
else {
  print "Upper Case Pattern: $p\n";
}

my $i = 1;
while ($contents =~ [b][COLOR=#3465A4]/$p/gi[/color][/b]) {[b][COLOR=#3465A4] # make it case-insensitive match[/color][/b]
  my $match = $&;
  print "\$i = $i, $match\n";
  $i++;
}
__DATA__
<a class="download" onclick="return false;" href="confirmation.aspx?id=36888" bi:fileurl="
[URL unfurl="true"]http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu"[/URL]

var downloadFileUrl = "[URL unfurl="true"]http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-[/URL][COLOR=#3465A4][b]X86[/b][/color].msu" ; [b][COLOR=#3465A4]// It's upper case!![/color][/b]

$("#ctl00_ctl21_ColumnRepeater_ctl00_RowRepeater_ctl01_CellRepeater_ctl00_ctl01").details({ "downloadUrl": "[URL unfurl="true"]http://download.microsoft.com/download/8/D/5/8D5F90F3-AC24-4A15-9716-BAE10533977A/Windows6.0-KB2809289-x86.msu",[/URL] "enableAtlasActionTag": true, "atlasActionTag": ""
 
Hi

That is because putting a regular expression into a variable also includes the flags :
Code:
[blue]  DB<1>[/blue] print qr{foo}
(?^:foo)
[blue]  DB<2>[/blue] print qr{foo}i
(?^i:foo)

There the [tt]?^[/tt] resets the flags locally inside the group. So you have to specify the case-insensitive flag at the [tt]qr[/tt] :
Code:
 regular expression
[b]my[/b] [navy]$pattern3[/navy] [teal]=[/teal] [b]qr[/b][fuchsia]{[URL unfurl="true"]http://download\.microsoft\.com/download/.+?-[/URL][/fuchsia][navy]$kb[/navy][fuchsia]-[/fuchsia][navy]$cpuTypeInL[/navy][fuchsia].*?.msu}[/fuchsia][highlight][b]i[/b][/highlight][teal];[/teal]
[b]my[/b] [navy]$pattern4[/navy] [teal]=[/teal] [b]qr[/b][fuchsia]{[URL unfurl="true"]http://download\.microsoft\.com/download/.+?-[/URL][/fuchsia][navy]$kb[/navy][fuchsia]-[/fuchsia][navy]$cpuTypeInU[/navy][fuchsia].*?.msu}[/fuchsia][highlight][b]i[/b][/highlight][teal];[/teal]

Feherke.
[link feherke.github.com/][/url]
 
Excellent!! Thank you, Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top