Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Find Nearest Match Without Replacement from 2 tables

Status
Not open for further replies.

waubain

Technical User
Dec 13, 2011
200
US
I have 2 tables. A treatment table that has 200 subjects and a control table with 4000 subjects (these are mutually exclusive on ID). Each table has an ID column and a Probability Score (0-1). I need to match subjects from the treatment table with a subject from the control table on the probability score without replacement. The output matched table(200 rows) has 2 columns: the IDtreatment and their matched IDcontrol.

There are basically 2 algorithms.
1) The 'greedy algorithm' randomly sorts the treatment group and then picks the closes match on Probability Score without replacemtent. This is an acceptable practice, but does not necessarily minimize the sum of the distances between probabilities.
2) The global algorithm evaluates all the distances between probabilities and somehow selects pairs that minimize the cummulative sum of the distances of the pairs, again without replacement.

This is a one time query. This is beyond my knowledge base so I do not have any SQL tries to present, but I am expecting this may be a time when cursors or a while loop is of value, since I am unsure how this could be done as a set(my old brain).

Consider this simulated data. In my data [randomnumber] is assigned as a set not individually.

Code:
--drop table #treatment
CREATE TABLE #treatment(SubjectID int, ProbScore float, RandomNumber bigint)
INSERT INTO #treatment VALUES (1, 0.10, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
                            ,(2, 0.20, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))

--drop table #control
CREATE TABLE #control(SubjectID int, ProbScore float, RandomNumber bigint)
INSERT INTO #control VALUES (3, 0.07, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
                           ,(4, 0.08, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
                           ,(5, 0.11, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
                           ,(6, 0.45, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
                           ,(7, 0.47, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))

Display all probabilities between tables
Code:
SELECT
     t.SubjectID AS [tID]
     ,ABS(t.ProbScore-c.ProbScore) AS PDiff
     ,c.SubjectID AS [cID]
FROM #treatment as t 
CROSS JOIN #control c
ORDER BY [tID], [PDiff]

In the greedy algorithm the matches would be (1,5) and (2, 4). The cummulative summation of distances is 0.13
In the global algorithm the matches would be (1,4) and (2, 5). The cummulative summation of distances is 0.11

Had the order of #treatment.[randomnumber] been different, the the greedy pairs would be different, but I am believing the global would end up being the same.

I would appreciate any help getting started, and I am ok with the greedy algorithm approach. Thank you in advance.

You don't know what you don't know...
 
I wasn't familiar with the term "without replacement" in math/probability, but got it.

Each #treatment subject can be paired with a #control subject, but each subject can only be used once.
Every #treatment subject needs to be paired with a #control subject, so unpaired #control subjects will remain,
but no #control subject may be paired with two #treatment subjects.

What I once did was a search for similar formulas (recipes), that meant best fit of several numbers (percentages of ingredients) per entity, not just one number (probScore).

In that aspect I did something much harder already, but the solution was and is no single query nor cursor loop, nor recursion. This needed some queries narrowing the result with each ingredient given. I can't publish that solution here, not only because it would be too much code.

I played a bit around with the idea of once again using the new Window functions LEAD and LAG, as the nearest matches of any subject must be either LEAD or LAG of the records ordered by ProbScore. Nevertheless, this soon get's too convoluted and yet not addressing the "without replacement" catch to never use a #control subject twice.

I just hope these thoughts give someone else an idea of how to best approach this.

Bye, Olaf.
 
Olaf,

Just to clarify your statement:
1) Each #treatment subject can be paired with a #control subject, but each #control subject can only be used once.
2) Every #treatment subject needs to be paired with a #control subject, so 3800 unpaired #control subjects will remain,
but no #control subject may be paired with two #treatment subjects. yes, same as #1

Thanks for the suggestions.
Bob



You don't know what you don't know...
 
Well, also the #treatment subjects only occur in one pair. It's maybe more obvious, because you describe the greedy algorithm iterating all those subjects and therefore certainly only using each once. If you go for any set based solution you might join the one with the other table in ways you'd also double use those, which of course also needs to be prevented.

Bye, Olaf.
 
Am I correct in saying that your global algorithm is basically the one normally called optimal matching? if so are you aware that Gu and Rosenbaum (1993) compared greedy and optimal matching and found that optimal matching did no better than greedy matching in producing balanced matched samples?

Just saying this in case it helps for whatever you are doing.

unfortunately I don't have time at the moment to dig into this - its a good challenge and would make me happy to give a good solution for it.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

FAQ219-2884
FAQ181-2886
 
Frederico, I would say you are correct. Both Global and Optimal seemed to be used interchangeably, and I just picked one. I was not aware of the article you referenced, but that fact is good to know and add to our reference list for this project.

I am willing to accept a greedy matching solution, as conceptually, it makes the most sense.

I appreciate your insights and comments. Thanks.
-Bob

You don't know what you don't know...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top