How to Escape a variable Regex

zackiv31 · Jan 22, 2008

I have a function that parses an html page and extract phone numbers.. the problem is it matches lines of numbers that are part of HTML tags.. which I want to exclude. Here is how I get around it:

Code:

while($html=~/((\d{3})[\s-\(\)\.]*(\d{3})[\s-\(\)\.]*(\d{4}))/g){
	if ($html !~ /\<.+$1.+\>/){

The problem with the second line is that if the $1 contains characters that aren't escaped (but should be). It causes my application to crash.

Is there a perl function escape($1) or another way to do this?

KevinADC · Jan 22, 2008

maybe using \Q..\E will work:

Code:

if ($html !~ /\<.+\Q$1\E.+\>/){

\Q escapes nearly all non-alphabetical characters except $ and @. \E just tells the regexp where to stop escaping.

http://perldoc.perl.org/perlretut.html

But you have other problems:

Code:

[\s-\(\)\.]

You need to escape the '-' in the above character class because perl will interpret that as a range \s-\( and there is no such range. If you have warnings turned on (which you should) you would have caught that.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

spookie · Jan 22, 2008

Hi,

is it matches lines of numbers that are part of HTML tags

I couldn't get that. Can you post some sample data?

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

zackiv31 · Jan 22, 2008

spookie this would match: (but i don't want it to)

Code:

<td align="center"><img src="[URL unfurl="true"]http://images.craigslist.org/01010001040501030620080121c3388f316338f3b7ba007cc2.jpg"></td>[/URL]

spookie · Jan 22, 2008

zackiv31,

Try this

Code:

while($html=~/((\d{3})[\s-\(\)\.]*(\d{3})[\s-\(\)\.]*(\d{4})).*?\</g){
// no need of if condition here

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

spookie · Jan 22, 2008

Also as Kevin pointed out, its better to escape '-' character i.e.\-

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

prex1 · Jan 22, 2008

This post is somewhat the continuation of another thread: please go on with the same thread when the argument is the same.
You need to better define what a phone number can look like for your script to catch it.
For example you might decide that only a sequence of exactly 3+3+4=10 digits, with optional separator after the third and the sixth (the separator being a single non digit character), is a valid one.
In that case you should do:

Code:

while($html=~/\D((\d{3})\D?(\d{3})\D?(\d{4}))\D/g){

Note however that this won't match two phone numbers that are strictly consecutive.
Your example above won't match with this, but, if you still have a possible match within html tags, your second line may be a solution, not very effective though. In that case you need a [tt]?[/tt] after the [tt]+[/tt] quantifiers, and I would prefer a [tt]*[/tt] quantifier to avoid matching a number strictly within bracketing (though without html meaning).
Concerning the escape, perl doesn't have it: please specify with various examples what do you mean.
Also consider that in html you can have named character entities and numbered character entities like [tt]0[/tt] : your task is not a simple one.

Franco

http://www.xcalcs.com

: Online tools for structural design

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

spookie · Jan 22, 2008

It should be

Code:

while($html=~/((\d{3})[\s-\(\)\.]*(\d{3})[\s-\(\)\.]*(\d{4})).*?[^\>]\</g){
// no need of if condition here

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

KevinADC · Jan 22, 2008

No, it should not be:

[\s-\.]

it should be:

[\s\-\.]

or better wriiten as:

[\s().-]

which is easier to read. The dot does not need escaping in a character class, nor do the parenthesis. Most characters have no meta meaning inside a character class, the only ones that do are:

-]\^$

and the pattern delimiter of your regexp. And '^' and '-' only have meta context depending on where they are used in the character class. Safer just to escape them if you don't know where that is.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

zackiv31 · Jan 23, 2008

Spookie that pattern doesn't work for my documents (I don't know a lot of perl greediness, but I think that's why).. in particular this one:

Code:

r"><img src="[URL unfurl="true"]http://images.craigslist.org/010108010209010307200801214f948f79c7e6aff4fa00c89b.jpg"></td>[/URL]

zackiv31 · Jan 23, 2008

As of now this seems to work for me... and logically it makes sense.

Code:

while($html=~/(\>[^\<]+?(\d{3})[\s\-\(\)\.]*(\d{3})[\s\-\(\)\.]*(\d{4}))/g){

KevinADC · Jan 23, 2008

You could benefit from understanding character classes and how to write them more clearly and understand how they differ from other patterns used in regular expressions. All those backslashes make reading your regexp near impossible and easy to make an error. Use the x modifier to comment your regular expressions to make them understandable for future reference:

Code:

while (
$html =~ /
(          # start a capturing group $1
>          # match a closing bracket >
[^<]+?     # match one or more of anything except < 
(\d{3})    # match 3 digits and store in $2
[\s().-]*  # match zero or more of these in any order
(\d{3})    # match 3 digits and store in $3
[\s().-]*  # match zero or more of these in any order
(\d{4})    # match 4 digits and store in $4
)          # end capturing group $1
/gx )
{

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

zackiv31 · Jan 23, 2008

lol.. i don't even know what a character class is... i'm no perl programmer...

that is a lot more readable... but it makes my code so long!

KevinADC · Jan 23, 2008

OK, no problem.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

prex1 · Jan 23, 2008

zackiv31,
your regex is stil not OK.
It won't match this: [tt]>1234567890[/tt], for it to be matched replace the [tt]+[/tt] with a [tt]*[/tt].
With this [tt]>123-456-7890xx098-765-4321[/tt] it matches only the first number (though this is possibly what you want).
With this [tt]>12345678900987654321[/tt] it matches the first 10 digits, though this is not really a phone number.

Franco

http://www.xcalcs.com

: Online tools for structural design

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

zackiv31 · Jan 24, 2008

That "+" -> "*" was a good insite... thank you.

The others ones are merely tradeoffs on how robust the application needs to be.

With that one change it seems to be doing what I want.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to Escape a variable Regex

zackiv31

Programmer

KevinADC

Technical User

spookie

Programmer

zackiv31

Programmer

spookie

Programmer

spookie

Programmer

prex1

Programmer

spookie

Programmer

KevinADC

Technical User

zackiv31

Programmer

zackiv31

Programmer

KevinADC

Technical User

zackiv31

Programmer

KevinADC

Technical User

prex1

Programmer

zackiv31

Programmer

Similar threads

Part and Inventory Search

Sponsor