php regex match for email

j4606 · May 26, 2009

Guys,

I'm trying to match two instances of an email within a web scraping script.

The first sample looks like this:

Code:

mailto:pers-zeawm-1181125745@mywebsite.org?subject=Aleman%20busca%20mujer%20de%20panama">pers-zeawm-1181125745@mywebsite.org</a> <sup>[<a href="[URL unfurl="true"]http://www.mywebsite.org/about/help/replying_to_posts"[/URL] target="_blank">

Within that sample I'm trying to match the following email "pers-zeawm-1181125745@mywebsite.org"

There is also a second match I'd like to make. This time the sample is as follows:

Code:

Reply to: <a href="mailto:&#112;&#101&#103;&#115;&#108;&#105;&#115;&#116;&#46;&#111;&#114;&#103;?subject=">&#112;&#101;&#114;&#115;&#;</a> <sup>[<a href="[URL unfurl="true"]http://www.mywebsite.org/about/help/replying_to_posts"[/URL] target="_blank">

As you can see the email is encoded after the mailto: and I need to extract the same email information.

thanks!

feherke · May 27, 2009

Hi

Supposing your first sample data is stored in $str, this will put the somehow e-mail-looking parts into $match :

Code:

[url=http://php.net/preg_match_all/]preg_match_all[/url]('/[[:alnum:]._-]+@[[:alnum:]._-]+/',$str,$match);

If you want a preciser solution is enough to search the web for other e-mail address regular expression and replace the first parameter.

Supposing the second sample data is stored in $str, this will replace it with its plain version, without character entities :

Code:

$str=[url=http://php.net/html_entity_decode/]html_entity_decode[/url]($str);

After that, you could use the first code. However will return nothing, because the second sample contains no e-mail address.

Note that [tt]html_entity_decode()[/tt] expects valid HTML. Your second sample is not, because the second entity misses its closing semicolon.

Feherke.

http://rootshell.be/~feherke/

jpadie · May 27, 2009

this pattern comes from regular-expressions.info

Code:

$pattern = '/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b/imsx';
you may need to add to the list of gTLD's to take account of the new ones (if there have been any).

feherke · May 27, 2009

Hi

Let us highlight your string(s) :

jpadie said:

Code:

$pattern = [highlight pink]'/[a-z0-9!#$%&'[/highlight]*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&[highlight pink]'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b/imsx'[/highlight];

Sorry for being picky. ;-)

Feherke.

http://rootshell.be/~feherke/

jpadie · May 27, 2009

sorry for that. cut and paste and then too sloppy to escape as required. I'll leave it as an exercise for the OP. note also that the forward slashes in the pattern must be escaped or the pattern separator changed.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

php regex match for email

j4606

MIS

feherke

Programmer

jpadie

Technical User

feherke

Programmer

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor