Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parsing 'From:' header regexp problem

Status
Not open for further replies.

georgeocrawford

Technical User
Aug 12, 2002
111
GB
Hi all,

This isn't quite the same question as normal. I don't want to validate an email address, instead I want to parse a 'From:' header into its constituent parts.

I have a complex web application which imports emails into a phpBB forum. Part of that process is to see if the email is from a registered member of the forum. If it is, a new post is made on that member's behalf. If not, the post is made by the 'Guest' user.

There is a configurable option which can be set to determine how guest posts are entered into the forum. The 'display name' which is entered as part of the posting process can be set to any one of the following:

1. username at example.com
2. Full Name
3. Full Name (username at example.com)

If any address does not specify a 'Full Name' part, I have decided to use the 'username' part (i.e. the first part of the email address up to the @) as the full name.


For example, if the contents of the 'From: ' field of the email was as follows:

Code:
"George Crawford" <myaddress@domain.co.uk>
I could choose one of these to use as the display name:

1. myaddress at domain.co.uk
2. George Crawford
3. George Crawford (myaddress at domain.co.uk)


Example 2: if the contents of the 'From: ' field of the email was as follows:

Code:
myaddress@domain.co.uk
I could choose one of these to use as the display name:

1. myaddress at domain.co.uk
2. myaddress
3. myaddress (myaddress at domain.co.uk)


So that's the challenge. For a while, this has been set up and seemingly working OK. However, we're starting to find bugs in the regexp pattern, and I need some help.

Firstly, to find the actual email address, I'm using this regexp:

Code:
<?([^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+(?:\.[^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+)*\@[^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+(?:\.[^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+)+)>?

I don't want to get into an RFC822 argument here, but I suppose some people will complain that this is no good already. Never mind!

This finds (or so I hope) a correctly-formatted email address, optionally enclosed with angle brackets.

I need to use this within a larger regexp pattern to split my 'From:' field into its constituent parts.

Now, as far as I know, the 'Full Name' part can either be enclosed in double quotes or not - both are valid (question: are SINGLE quotes valid too?). There can be a space between the 'Full Name' and the email address, but it is not required. Whether the email address MUST be enclosed by angle brackets, I don't know. Additionally, I believe there can be multiple addresses in the From header, separated by a comma. Is this correct? If so, I only need to return the first name/address combination (i.e. up to the first comma).


So, I need a regexp which find the 'Full Name' and the 'username@address.com' parts in a 'From:' header. I would love some advice on this - I'm thinking that I should make it as 'liberal' as possible to give the greatest chance of finding the parts, even if the 'From:' header isn't 100% RFC822 compliant.


After lots of trial and error, I have got this far:
Code:
(?:(?:(')|("))?(?(1)([^<>',]+)|(?(2)([^<>,"]+)|([^<>,])))(?(1)\1|(?(2)\2))\s+)?<?([^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+(?:\.[^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+)*\@[^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+(?:\.[^\x00-\x20()<>@,;:\\".[\]\x7f-\xff]+)+)>?

The idea here is to match an optional ' or " mark, followed by anything which isn't one of (an angle bracket, a comma or whatever the first match was - i.e. a ' or a "), followed by whatever the first match was, followed by any amount of space, followed by the email address.

This isn't quite right. I haven't yet worked out where it works and where it doesn't. Certainly it matches
Code:
"George Crawford" <myaddress@domain.co.uk>
OK, but not other variants.



Sorry for the lengthy post, but I'd very much appreciate some advice on this one. Especially my questions - can the From header have multiple addresses (if so, are they always comma-separated), and is the 'Full Name' part enclosed in double quotes, single, nothing, or any combination of these?

And hey - if there's a class which can do this for me, TELL ME!


Many thanks!

______________________

George
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top