Architectural conundrum

jpadie · Dec 8, 2009

I'm looking for some help from experienced coders. it may be better to post in a database forum but I am sure you guys have come across this problem before and have solutions.

situation
I have designed an email management module for a professional practice management system. sort of an enhanced CRM system.
All emails are ingested by the module and need to be 'filed' against the right project.
Recognition on sender name or receiver name is not possible as there is a many to many relationship between projects and both name criteria.
Instead recognition is done on an hierarchical basis:
if there is a unique project identifier in the email body then recognise on that.
otherwise recognise on keywords.
keywords are stored in a database against each project.
the first stage of the processing is to scan the email for the project ID. if that is found then no problem.
There are approximately 500 active projects at any one time. Each project has about 5 (unique) keywords. There are about 250 emails per day across the projects.

Question
My question is, what is the best way to process the second limb? Examples are:

* multiple strpos() calls on an iteration of the recordset of all keywords on all projects
* some db query (note that the email body actually does not live in a database until later in its life cycle. But a temp table can be easily created for this kind of manipulation).
* something else.

whichever way I look at it, i get a hellish looking piece of processing that i will definitely want to hand off to a some forked process. I can't help thinking that there is a way to skin the cat though.

Your sage advice is appreciated.

thanks
Justin

ingresman · Dec 8, 2009

My initial thoughts would be to use a MYSQL text query of the data rather than strpos. If the data ends up in a table later on then that would certainly look to be the way to go.
Alternativly you could do a lexical analysis of the email body, build a symbol table as you go and search that table when you've completed the lex.
In view of the low voumes I'd get MYSQL to do the heavy lifting for you.

feherke · Dec 8, 2009

Hi

I would say, 500*5=2500 possible keywords are more than the number of words in an average e-mail. So instead of searching the mail for keywords, I would search the keyword table for mail words :

Code:

[COLOR=darkgoldenrod]preg_match_all[/color][teal]([/teal][green][i]"/\w{3,}/"[/i][/green][teal],[/teal][COLOR=darkgoldenrod]strtolower[/color][teal]([/teal][navy]$mail[/navy][teal]),[/teal][navy]$word[/navy][teal]);[/teal]

[navy]$query[/navy][teal]=[/teal][green][i]"select * from $keywordtable where $keywordfield in ('"[/i][/green][teal].[/teal][COLOR=darkgoldenrod]implode[/color][teal]([/teal][green][i]"','"[/i][/green][teal],[/teal][COLOR=darkgoldenrod]array_unique[/color][teal]([/teal][navy]$word[/navy][teal][[/teal][purple]0[/purple][teal]])).[/teal][green][i]"')"[/i][/green][teal];[/teal]

Justin said:
i will definitely want to hand off to a some forked process

Like the spam filters does ? They scan the words in the mail, calculate the mail's spam likeliness and insert that value in the mail using an additional header line. Only that your processor will add the project id in the extra header line. That sounds like a good plan.

Feherke.

http://free.rootshell.be/~feherke/

jpadie · Dec 8, 2009

thanks both of you.

feherke's solution fits better within my skill set and avoids the need to write multiple variants to cater for db portability.

But before I give up entirely on a db solution, could you give me a further hint on the text search that you are pointing toward? do you mean something like this?

Code:

select * from posts where postID=xxx and match (postSubject,postBody) against 
(select keyword from keywords, IN BOOLEAN MODE)

(note: i don't know whether AGAINST can take a sub-select as an argument - this is pseudo sql only)

with the above sql i can only see a way of getting a positive/negative on keywords matching. not the matched keyword or the projectID. i have tried thinking of how a join could be used but have drawn a blank.

or would it be the other way around?

Code:

$query = 
"select projectID from keywords where match (keyword) against ('". implode ("','" . array_map('mysql_real_escape_string', explode (' ', $postSubject . ' ' .  $postBody))) . "' IN BOOLEAN MODE)";

which essentially follows feherke's reasoning of taking the email post as the needle. I would obviously use the type of preg_match that feherke suggested as well, rather than a simple split on horizontal white space.

jpadie · Dec 10, 2009

bump @ingresman

ingresman · Dec 10, 2009

Sorry, should have come back earlier, it's been a long and difficult day !
Yep, just that, the match against syntax with a full text index..
I must admit I never think in terms of regular expressions, perhaps I should get into them a bit more

jpadie · Dec 10, 2009

hi ingresman,

thanks for the feedback. I am still a bit in the dark about whether you are for query one or two? If the first, how would you go about identifying that actual keywords that are matched? is there some whacky join that could be used?

thanks
Justin

ingresman · Dec 11, 2009

I'll have a look over the weekend, I've never used full text in mysql.
I'll read through your post in a bit more detail.

ingresman · Dec 14, 2009

I had a bit of a play with the indexs and didn't have much joy to be honest. It's really suited to single word/phrase seaches
I think Feherke idea is probably what will get you under way as you will be able to identify which of the key words you are searching for is actualy in the email (if you see what (I mean?)

jpadie · Feb 9, 2010

i have seen feherke's link to this post. so I am posting back my thinking on this topic.

brutally I am still struggling to find a sensible solution. it's made complicated by the requirement that some keywords are actually keyphrases. and there is nothing to stop a word of a keyphrase being a keyword in another instance. so exploding the whole incoming mail message by word boundaries is not a 100% solution.

i have not yet implemented anything because i have not been able to come up with a good answer to the holistic problem. However now that I come to look at this I wonder whether both feherke's and ingresman's solutions have applicable merit if added together.

so in pseudologic i would do this

1. explode the incoming mail by word boundary.
2. perform a match or regex or something of all of the unique words against the keyword database (note that this has to apply against the full text of the keywords.
3. this gives me a recordset of potential matches.
4. then with this smaller subset of matches apply the reverse logic and str_pos iteratively each keyphrase against the mail. or even use groupconcat or something to create an alternating regex to apply against the mail (ultimately a performance question).
5. that will either provide a hit (which should be unique) or no hits (in which case a manual process is required).

what thinkest ye?

ingresman · Feb 10, 2010

That sounds like a fair way forward, but I wonder if it would be complex to implement and difficult to extend in the future?, it looks like you trying to almost do some natural language processing here.
Have you had a look at tools like

http://www.sybase.co.uk/products/mobileenterprise/answersanywhere

I've seen it in action and it is pretty cool, Sybase offer a 90 day trail (don't know how much it is after that), but at the very least it may give you some food for thought.
Or completly off the wall (and maybe not relevant or possible could you use something like the Bing API to the had work for you ?

IPGuru · Feb 11, 2010

Perhaps the real solution is to train the end useres/customners & insist they include the project ref in the e-mail subject.

you could then either move mails without a progct code to a seperate location for manual intervention or have them bounced back requesting the mail is formated correctly.

It is a bit unpleasant at first but after a few weeks your users will soon get used to it

jpadie · Feb 11, 2010

@IPGuru
good idea. but this is for a legal practice. so the people that we are corresponding with are people that we are suing on behalf of our clients. To be honest, I think it would be an uphill struggle to mandate what information they put in an email. In fact they rarely even get the court documents right...

it's ok for replies etc as my software will insert a unique identifier which will fingerprint the conversation. it's situations where the fingerprint has been removed or the email is an origination that the problem exists.

@Ingresman
I have a feeling that it is neither difficult to implement nor extend. but may be computationally very heavy. i will try to take it from the realms of logic to reality in the next couple of weeks. Regrettably sybase is not an option due to the site being built on a wordpress backbone (and WP supports only mysql)

IPGuru · Feb 11, 2010

@IPGuru
good idea. but this is for a legal practice. so the people that we are corresponding with are people that we are suing on behalf of our clients. To be honest, I think it would be an uphill struggle to mandate what information they put in an email. In fact they rarely even get the court documents right...

Certainly if there is no incentive for the people to be assisting you then this aproach won't work.

I guess it is back to your original aproach of trying to detect the correct infomation & failing that move them to a manual queue for allocation, sorry.

Click to expand...

ingresman · Feb 11, 2010

@JP - in this case Sybase is only the manufacturer, they do more than data bases these days.
Could you give a little example just to crystalise?
I got to thinking that you might need a precidence for each phrase and word, much in the same way as parsing in a language For example:
If you had some text such as
"One sunny morning the cat was feeling good. The cat sat on the mat but the cat was not aware that the mat was not to be sat on"
Your business context might be aware of the following phrases and words in order of priority.
.
"The cat sat on the mat"
"the mat"
"the cat"
.
In my example the phrase "the cat" occurs first in the text, but it is followed by "the cat sat on the mat" which according to my rules has a higer precidence than "the cat" so it would be the defining phrase in the text.
.
If I'm any where near being correct here you might be able to use somethinmg like Lex and Yacc to define a grammer for the types of legalese you might get. Or simply use the full text index in mysql.
I assumue you might get "the case of mr vs mrs jones" as a phrase.
Bear in mind it's late and I'm going home !
If it is computationaly heavy it might be interesting to code the eventual soloution as an extention.

jaxtell · Feb 11, 2010

I might be off, but would something like this work?

Code:

SELECT project_id FROM `keywords` WHERE 'this is the text of the email' like concat('%',keyword,'%')

-----------------------------------------
I cannot be bought. Find leasing information at

http://www.joshaxtell.com/

jpadie · Feb 11, 2010

okey dokey

here is some crude exemplars and a first approach using PCRE below.

Code:

<?php
$emails = array(
<<<HTML
This is an email about a case between client Limited and My Opponent Limited.

HTML
, <<<HTML
This is another email.  It is actually about a case between client Limited and Another Opponent Limited.

HTML
, <<<HTML
And wouldn't you know it but I have another one of these cases between client Limited and an opponent called Yet Another Opponent Limited
HTML
, <<<HTML
and lastly an email with a unique recogniser in Iterator
-AACUNIQUEID~~1231231231231321311~~AACUNIQUEID-
HTML
);

$keywords = array(
			'file1'=>'My Opponent Limited',
			'file3'=>'Yet Another Opponent Limited',
			'file2'=>'Another Opponent Limited'
			);

//approach 1
$keywords = array_map('strtolower', $keywords);
$p = '/(' . implode (')|(', $keywords) . ')/i';
foreach ($emails as $key=>$email){
	preg_match($p, strtolower($email), $match);
	if (empty($match[0] )){
		$results['email ' . $key] = 'No match';
	} else {
		$mKey = array_search($match[0], $keywords);
		$results['email '. $key] = 'matches ' . $mKey;
	}
}
echo '<hr/><pre>';
print_r($results);
?>

Code:

Array
(
    [email 0] => matches file1
    [email 1] => matches file2
    [email 2] => matches file3
    [email 3] => No match
)

the business rules are more complex than this shows. but I think it is a good first fist.
the execution time across all emails is 0.0007 seconds on a (not very fast) test rig.

jpadie · Feb 11, 2010

@jaxtell
I'm not sure whether that would work. Looks ok though. I'm sure I have thought of a single query solution before and came up with reasons why it would not work.
Thanks for posting. I'm too tired to give this more thought now but will do so tomorrow.

Justin

jaxtell · Feb 11, 2010

It seems to work for me. I added a weight column, to make sure the more specific/longer keywords end up on top. This would also allow matches on multiple keywords, though only one match per row.

Code:

SELECT project_id, sum(weight) as weighted
FROM `keywords` 
WHERE  'This is another email.  It is actually about a case between client Limited and Yet Another Opponent Limited.' LIKE CONCAT(  '%', keyword,  '%' ) 
group by   project_id
ORDER BY sum(weight) DESC

-----------------------------------------
I cannot be bought. Find leasing information at

http://www.joshaxtell.com/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Architectural conundrum

Technical User

IS-IT--Management

Programmer

Technical User

Technical User

IS-IT--Management

Technical User

IS-IT--Management

IS-IT--Management

Technical User

IS-IT--Management

Vendor

Technical User

Vendor

IS-IT--Management

Programmer

Technical User

Technical User

Programmer

Log in

Part and Inventory Search

Sponsor