Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

php regex match for email

Status
Not open for further replies.

j4606

MIS
Nov 28, 2005
349
US
Guys,

I'm trying to match two instances of an email within a web scraping script.

The first sample looks like this:
Code:
mailto:pers-zeawm-1181125745@mywebsite.org?subject=Aleman%20busca%20mujer%20de%20panama">pers-zeawm-1181125745@mywebsite.org</a> <sup>[<a href="[URL unfurl="true"]http://www.mywebsite.org/about/help/replying_to_posts"[/URL] target="_blank">

Within that sample I'm trying to match the following email "pers-zeawm-1181125745@mywebsite.org"

There is also a second match I'd like to make. This time the sample is as follows:

Code:
Reply to: <a href="mailto:&#112;&#101&#103;&#115;&#108;&#105;&#115;&#116;&#46;&#111;&#114;&#103;?subject=">&#112;&#101;&#114;&#115;&#;</a> <sup>[<a href="[URL unfurl="true"]http://www.mywebsite.org/about/help/replying_to_posts"[/URL] target="_blank">

As you can see the email is encoded after the mailto: and I need to extract the same email information.

thanks!
 
Hi

Supposing your first sample data is stored in $str, this will put the somehow e-mail-looking parts into $match :
Code:
[url=http://php.net/preg_match_all/]preg_match_all[/url]('/[[:alnum:]._-]+@[[:alnum:]._-]+/',$str,$match);
If you want a preciser solution is enough to search the web for other e-mail address regular expression and replace the first parameter.

Supposing the second sample data is stored in $str, this will replace it with its plain version, without character entities :
Code:
$str=[url=http://php.net/html_entity_decode/]html_entity_decode[/url]($str);
After that, you could use the first code. However will return nothing, because the second sample contains no e-mail address.

Note that [tt]html_entity_decode()[/tt] expects valid HTML. Your second sample is not, because the second entity misses its closing semicolon.

Feherke.
 
this pattern comes from regular-expressions.info

Code:
$pattern = '/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b/imsx';
you may need to add to the list of gTLD's to take account of the new ones (if there have been any).
 
Hi

Let us highlight your string(s) :
jpadie said:
Code:
$pattern = [highlight pink]'/[a-z0-9!#$%&'[/highlight]*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&[highlight pink]'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b/imsx'[/highlight];
Sorry for being picky. ;-)

Feherke.
 
sorry for that. cut and paste and then too sloppy to escape as required. I'll leave it as an exercise for the OP. note also that the forward slashes in the pattern must be escaped or the pattern separator changed.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top