Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Summarising duplicated lists

Status
Not open for further replies.

Griffyn

Programmer
Jul 11, 2002
1,077
AU
Hi,

The project I'm working on is a web service that provides a page that can be used to send an e-mail to our customer base. The e-mail addresses are gathered from 6 different sources in our system, and each source has 6 different categories. The user will choose a category, and then any combination of sources. I need to calculate and display the number of unique e-mail addresses the user's combination choice will result in. E-mail addresses may be duplicated across multiple sources, and even within a source's categories.

I'm trying to find an efficient way to provide summary data to the webpage that doesn't involve sending the 36 lists of up to 2000 e-mail addresses each and doing the de-duplication in javascript whenever the user changes their choice.

I can't just provide a count of e-mail addresses for each category in each source, because if the user selects multiple categories, and I add all the selected category counts together, the total will be too high due to it including duplicates.

One idea I had was to generate a master de-duplicated index of e-mails, then send 36 bit-strings where each bit is set if the e-mail address at that bit's position in the master list is present in that source/category list. So if the master de-duplicated list had 5000 e-mails in it, each bit-string would have 5000 characters in it, and character eg. 401 would be '0' if the e-mail in masterList[401] was present in that source/category. '1' if it was. That would cut down on the data sent, but still be a computational task in javascript.

Then I figured it would be simpler to just calculate every combination of category for each source. 6 sources and 6 categories = 384 summaries (6 * 2^6). What I don't like about this is that it has the potential to skyrocket if more categories are added (each category doubles the number of summaries that must be generated).

I reckon there's got to be a smart shortcut to all this that involves sending less data, and a faster algorithm to calculate the totals I'm after.

Any thoughts are appreciated.

Thanks.
 
As a database developer I see this as the task of a database.

Your idea of building a master index of all emails is what came to my mind at first, too. You just pull together the 6 sources into one new "data mart", doesn't have to be complicated, all you seem to need is three tables, since one email might be in several categories.

emails (ID, mailaddress)
categories (ID, categoryname)
categoriesofemails (mailID, categoryID)

When staging data here you already dedupe the emails and connect them with their categories. 5000 records are nothing even for MySQL, so you can easily get a count of any combination of categories (for example) by

Code:
SELECT Count(DISTINCT mailaddress) mailsummarycount FROM emails E inner join categoriesofemails CoE on E.id = CoE.mailID And CoE.categoryID IN (1,3,5)

The site would display the list of categories and offer a multi select choice, which in the end would result in the list of category IDs at the end of the query and that's it.

So in summary:
1. Setting up this small data mart with the three tables
2. Staging of data of the 6 sources into a data mart. This shouldn't take long and could be done multiple times a day, for example each hour.
3. Creating a script usable from the site via Ajax to return the count of mail adresses, main ingredient is the query given above.

You'll never need to show all mail addresses at the site, do you? If the choice is made you send out a newsletter, advert or any other type of mail at the server side. You get the mail addresses by the query without the Count, simply SELECT DISTINCT mailaddress...

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top