Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Small regex to parse HTML

Status
Not open for further replies.

spydermonkey

Programmer
May 24, 2004
31
0
0
US
Hey everyone. I am having a little trouble parsing a website with a small regex. I believe it has something to do with the greediness of .* but I don't know of another way around it.

Code:
#!/usr/bin/perl

use warnings;
use strict;

use LWP::Simple;

##############################
# Configuration section
##############################

my $url = 

"[URL unfurl="true"]http://rentacoder.com/RentACoder/misc/BidRequests/ShowBidRequests.asp?lngBidRequestListType=3&optSortTitle=2&lngBidRequestCa[/URL]

tegoryId=-1&txtMaxNumberOfEntriesPerPage=10&optBidRequestPhase=2&lngSortColumn=-6&blnModeVerbose=True&optBiddingExpiration=1"

;

my $file = "rac.txt";

##############################
# Do not edit below this line
##############################

open (LOG, "$file") or die "Cannot open file $file: $!";
my $saved = <LOG>;
if (!defined $saved)
{
  $saved = "0";
}
close(LOG);

my $content = get("$url");

print $content;

$content =~ m!<td colspan="2"><font size="1" color="white"><b>Title(.*)<b>Search Results: </b>!;

$content = $1;

print "content: $1";

if ($saved ne "0" && $saved ne $content)
{
   print "\a\a\a";
  
   open(LOG, "> $file") or die "Cannot open file $file: $!";
   print LOG $content;
   close(LOG); 
}
elsif ($saved eq "0")
{
   open(LOG, "> $file") or die "Cannot open file $file: $!";
   print LOG $content;
   close(LOG); 
}

I test print $content prior to the regex and it's holding the entire source code as expected. After the regex the $1 keeps returning an uninitialized value error. I am trying to match EVERYTHING between

Code:
<td colspan="2"><font size="1" color="white"><b>Title

and
Code:
<b>Search Results: </b>

Does anyone have any ideas?

Thanks for your help!
 
Code:
#!/usr/bin/perl

$html = '<td colspan="2"><font size="1" color="white"><b>Title-XYZ<b>Search Results: </b>';

$html =~ m|<[^>]+><[^>]+><b>Title([^<]+)<b>Search Results: </b>|;

print $1;


Kind Regards
Duncan
 
Thanks.

I tried that regex and it provides the same results :(
 
why not just use a split (or two) to the line you want, and then use sdubstitution to replace the table tags with new lines
--Paul

It's important in life to always strike a happy medium, so if you see someone with a crystal ball, and a smile on their face ...
 
spydermonkey

It does break up the string as I tested it:-

<td colspan="2"><font size="1" color="white"><b>Title[blue]-XYZ[/blue]<b>Search Results: </b>

$1 captures -XYZ

maybe you haven't explained the exact structure of the HTML - or I haven't understood it quite right

please give a couple of examples and specify what you would like to extract


Kind Regards
Duncan
 
As a general rule, it's usually a bad idea to try and parse HTML using regexps. You're almost always using a proper tag-aware parser (HTML::TokeParser::Simple is my personal favourite).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top