Small regex to parse HTML

spydermonkey · Jun 30, 2004

Hey everyone. I am having a little trouble parsing a website with a small regex. I believe it has something to do with the greediness of .* but I don't know of another way around it.

Code:

#!/usr/bin/perl

use warnings;
use strict;

use LWP::Simple;

##############################
# Configuration section
##############################

my $url = 

"[URL unfurl="true"]http://rentacoder.com/RentACoder/misc/BidRequests/ShowBidRequests.asp?lngBidRequestListType=3&optSortTitle=2&lngBidRequestCa[/URL]

tegoryId=-1&txtMaxNumberOfEntriesPerPage=10&optBidRequestPhase=2&lngSortColumn=-6&blnModeVerbose=True&optBiddingExpiration=1"

;

my $file = "rac.txt";

##############################
# Do not edit below this line
##############################

open (LOG, "$file") or die "Cannot open file $file: $!";
my $saved = <LOG>;
if (!defined $saved)
{
  $saved = "0";
}
close(LOG);

my $content = get("$url");

print $content;

$content =~ m!<td colspan="2"><font size="1" color="white"><b>Title(.*)<b>Search Results: </b>!;

$content = $1;

print "content: $1";

if ($saved ne "0" && $saved ne $content)
{
   print "\a\a\a";
  
   open(LOG, "> $file") or die "Cannot open file $file: $!";
   print LOG $content;
   close(LOG); 
}
elsif ($saved eq "0")
{
   open(LOG, "> $file") or die "Cannot open file $file: $!";
   print LOG $content;
   close(LOG); 
}

I test print $content prior to the regex and it's holding the entire source code as expected. After the regex the $1 keeps returning an uninitialized value error. I am trying to match EVERYTHING between

Code:

<td colspan="2"><font size="1" color="white"><b>Title

and

Code:

<b>Search Results: </b>

Does anyone have any ideas?

Thanks for your help!

duncdude · Jun 30, 2004

Code:

#!/usr/bin/perl

$html = '<td colspan="2"><font size="1" color="white"><b>Title-XYZ<b>Search Results: </b>';

$html =~ m|<[^>]+><[^>]+><b>Title([^<]+)<b>Search Results: </b>|;

print $1;

Kind Regards
Duncan

spydermonkey · Jun 30, 2004

Thanks.

I tried that regex and it provides the same results

PaulTEG · Jun 30, 2004

why not just use a split (or two) to the line you want, and then use sdubstitution to replace the table tags with new lines
--Paul

It's important in life to always strike a happy medium, so if you see someone with a crystal ball, and a smile on their face ...

duncdude · Jun 30, 2004

spydermonkey

It does break up the string as I tested it:-

<td colspan="2"><font size="1" color="white"><b>Title[blue]-XYZ[/blue]<b>Search Results: </b>

$1 captures -XYZ

maybe you haven't explained the exact structure of the HTML - or I haven't understood it quite right

please give a couple of examples and specify what you would like to extract

Kind Regards
Duncan

ishnid · Jul 2, 2004

As a general rule, it's usually a bad idea to try and parse HTML using regexps. You're almost always using a proper tag-aware parser (HTML::TokeParser::Simple is my personal favourite).

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Small regex to parse HTML

spydermonkey

Programmer

duncdude

Programmer

spydermonkey

Programmer

PaulTEG

Technical User

duncdude

Programmer

ishnid

Programmer

Similar threads

Part and Inventory Search

Sponsor