Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

difficulty with regular expressions

Status
Not open for further replies.

Vormav

Programmer
Jun 21, 2002
12
US
Just one little issue I'm trying to figure out with regular expressions. I've used them plenty of times before, but I've always had issues with expressions that work on urls.
In this case, I'm using Python to grab information from a php page. The page contents would include something like this:
<a href="display.php?f=2359">Random text</a>
<a href="display.php?f=3256">More random text</a>

...and so forth.

So, I have this:
Code:
import urllib2
import re
p = re.compile('(display.php\?)')
for line in urllib2.urlopen(path):
   for matched in p.findall(line):
      print matched

That works without a problem for matching all cases of display.php?, but if I try and change it even just a little, it falls apart:
Code:
p = re.compile('(display.php\?f)')
I am of course hoping in this case that I'll be matching against display.php?f with this new expression, which most definitely IS in the html file, but it's not matching anything.
I just don't really understand why that expression doesn't work. I've also tried changing it to this:
Code:
p = re.compile('(display.php\?)(f)')
...which also doesn't work.

So I'm guessing that there just has to be something important to the syntax of regular expressions that I've been completely missing here. Anyone mind filling me in?

The final expression that I'm looking for is a character set that matches display.php?f= followed by any number of digits followed by "> followed by a second character set of any number of alphanumeric characters. Which, to me, seems like it would be...
Code:
p = re.compile('(display.php\?f=(\d)*">)(\w*)')
(I don't need to escape the quotation mark in this case, do I?)
 
Okay, so after further experimentation, I've decided that it has to be some sort of encoding issue: If I save my html page out as a .txt file and load that locally, instead of opening the url, things start working perfectly.
Which begs the question, does the urllib2 library (or any other library) come with any methods for dealing with this? Right now, I don't really know of any way to deal with this issue, because by the time I can view what the re findall() method is seeing (say, by printing the output to the console), it appears to be in a normal format.

Also, before anyone makes the suggestion, I've found some pages that advise against the use of regular expressions on html content, suggesting that I instead use an HTML or XML parser. But this isn't really an option for me.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top