Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations dencom on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to Escape a variable Regex

Status
Not open for further replies.

zackiv31

Programmer
May 25, 2006
148
US
I have a function that parses an html page and extract phone numbers.. the problem is it matches lines of numbers that are part of HTML tags.. which I want to exclude. Here is how I get around it:

Code:
while($html=~/((\d{3})[\s-\(\)\.]*(\d{3})[\s-\(\)\.]*(\d{4}))/g){
	if ($html !~ /\<.+$1.+\>/){

The problem with the second line is that if the $1 contains characters that aren't escaped (but should be). It causes my application to crash.

Is there a perl function escape($1) or another way to do this?
 
maybe using \Q..\E will work:

Code:
if ($html !~ /\<.+\Q$1\E.+\>/){

\Q escapes nearly all non-alphabetical characters except $ and @. \E just tells the regexp where to stop escaping.


But you have other problems:

Code:
[\s-\(\)\.]

You need to escape the '-' in the above character class because perl will interpret that as a range \s-\( and there is no such range. If you have warnings turned on (which you should) you would have caught that.



------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Hi,
is it matches lines of numbers that are part of HTML tags
I couldn't get that. Can you post some sample data?

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.
 
spookie this would match: (but i don't want it to)

Code:
<td align="center"><img src="[URL unfurl="true"]http://images.craigslist.org/01010001040501030620080121c3388f316338f3b7ba007cc2.jpg"></td>[/URL]
 
zackiv31,

Try this

Code:
while($html=~/((\d{3})[\s-\(\)\.]*(\d{3})[\s-\(\)\.]*(\d{4})).*?\</g){
// no need of if condition here

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.
 
Also as Kevin pointed out, its better to escape '-' character i.e.\-



--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.
 
This post is somewhat the continuation of another thread: please go on with the same thread when the argument is the same.
You need to better define what a phone number can look like for your script to catch it.
For example you might decide that only a sequence of exactly 3+3+4=10 digits, with optional separator after the third and the sixth (the separator being a single non digit character), is a valid one.
In that case you should do:
Code:
while($html=~/\D((\d{3})\D?(\d{3})\D?(\d{4}))\D/g){
Note however that this won't match two phone numbers that are strictly consecutive.
Your example above won't match with this, but, if you still have a possible match within html tags, your second line may be a solution, not very effective though. In that case you need a [tt]?[/tt] after the [tt]+[/tt] quantifiers, and I would prefer a [tt]*[/tt] quantifier to avoid matching a number strictly within bracketing (though without html meaning).
Concerning the escape, perl doesn't have it: please specify with various examples what do you mean.
Also consider that in html you can have named character entities and numbered character entities like [tt]&#048;[/tt] : your task is not a simple one.

Franco
: Online tools for structural design
: Magnetic brakes for fun rides
: Air bearing pads
 
It should be
Code:
while($html=~/((\d{3})[\s-\(\)\.]*(\d{3})[\s-\(\)\.]*(\d{4})).*?[^\>]\</g){
// no need of if condition here

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.
 
No, it should not be:

[\s-\(\)\.]

it should be:

[\s\-\(\)\.]

or better wriiten as:

[\s().-]

which is easier to read. The dot does not need escaping in a character class, nor do the parenthesis. Most characters have no meta meaning inside a character class, the only ones that do are:

-]\^$

and the pattern delimiter of your regexp. And '^' and '-' only have meta context depending on where they are used in the character class. Safer just to escape them if you don't know where that is.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Spookie that pattern doesn't work for my documents (I don't know a lot of perl greediness, but I think that's why).. in particular this one:

Code:
r"><img src="[URL unfurl="true"]http://images.craigslist.org/010108010209010307200801214f948f79c7e6aff4fa00c89b.jpg"></td>[/URL]
 
As of now this seems to work for me... and logically it makes sense.

Code:
while($html=~/(\>[^\<]+?(\d{3})[\s\-\(\)\.]*(\d{3})[\s\-\(\)\.]*(\d{4}))/g){
 
You could benefit from understanding character classes and how to write them more clearly and understand how they differ from other patterns used in regular expressions. All those backslashes make reading your regexp near impossible and easy to make an error. Use the x modifier to comment your regular expressions to make them understandable for future reference:

Code:
while (
$html =~ /
(          # start a capturing group $1
>          # match a closing bracket >
[^<]+?     # match one or more of anything except < 
(\d{3})    # match 3 digits and store in $2
[\s().-]*  # match zero or more of these in any order
(\d{3})    # match 3 digits and store in $3
[\s().-]*  # match zero or more of these in any order
(\d{4})    # match 4 digits and store in $4
)          # end capturing group $1
/gx )
{

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
lol.. i don't even know what a character class is... i'm no perl programmer...

that is a lot more readable... but it makes my code so long!
 
OK, no problem.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
zackiv31,
your regex is stil not OK.
It won't match this: [tt]>1234567890[/tt], for it to be matched replace the [tt]+[/tt] with a [tt]*[/tt].
With this [tt]>123-456-7890xx098-765-4321[/tt] it matches only the first number (though this is possibly what you want).
With this [tt]>12345678900987654321[/tt] it matches the first 10 digits, though this is not really a phone number.

Franco
: Online tools for structural design
: Magnetic brakes for fun rides
: Air bearing pads
 
That "+" -> "*" was a good insite... thank you.

The others ones are merely tradeoffs on how robust the application needs to be.

With that one change it seems to be doing what I want.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top