I have 2 tables. A treatment table that has 200 subjects and a control table with 4000 subjects (these are mutually exclusive on ID). Each table has an ID column and a Probability Score (0-1). I need to match subjects from the treatment table with a subject from the control table on the probability score without replacement. The output matched table(200 rows) has 2 columns: the IDtreatment and their matched IDcontrol.
There are basically 2 algorithms.
1) The 'greedy algorithm' randomly sorts the treatment group and then picks the closes match on Probability Score without replacemtent. This is an acceptable practice, but does not necessarily minimize the sum of the distances between probabilities.
2) The global algorithm evaluates all the distances between probabilities and somehow selects pairs that minimize the cummulative sum of the distances of the pairs, again without replacement.
This is a one time query. This is beyond my knowledge base so I do not have any SQL tries to present, but I am expecting this may be a time when cursors or a while loop is of value, since I am unsure how this could be done as a set(my old brain).
Consider this simulated data. In my data [randomnumber] is assigned as a set not individually.
Display all probabilities between tables
In the greedy algorithm the matches would be (1,5) and (2, 4). The cummulative summation of distances is 0.13
In the global algorithm the matches would be (1,4) and (2, 5). The cummulative summation of distances is 0.11
Had the order of #treatment.[randomnumber] been different, the the greedy pairs would be different, but I am believing the global would end up being the same.
I would appreciate any help getting started, and I am ok with the greedy algorithm approach. Thank you in advance.
You don't know what you don't know...
There are basically 2 algorithms.
1) The 'greedy algorithm' randomly sorts the treatment group and then picks the closes match on Probability Score without replacemtent. This is an acceptable practice, but does not necessarily minimize the sum of the distances between probabilities.
2) The global algorithm evaluates all the distances between probabilities and somehow selects pairs that minimize the cummulative sum of the distances of the pairs, again without replacement.
This is a one time query. This is beyond my knowledge base so I do not have any SQL tries to present, but I am expecting this may be a time when cursors or a while loop is of value, since I am unsure how this could be done as a set(my old brain).
Consider this simulated data. In my data [randomnumber] is assigned as a set not individually.
Code:
--drop table #treatment
CREATE TABLE #treatment(SubjectID int, ProbScore float, RandomNumber bigint)
INSERT INTO #treatment VALUES (1, 0.10, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
,(2, 0.20, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
--drop table #control
CREATE TABLE #control(SubjectID int, ProbScore float, RandomNumber bigint)
INSERT INTO #control VALUES (3, 0.07, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
,(4, 0.08, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
,(5, 0.11, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
,(6, 0.45, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
,(7, 0.47, ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)))
Display all probabilities between tables
Code:
SELECT
t.SubjectID AS [tID]
,ABS(t.ProbScore-c.ProbScore) AS PDiff
,c.SubjectID AS [cID]
FROM #treatment as t
CROSS JOIN #control c
ORDER BY [tID], [PDiff]
In the greedy algorithm the matches would be (1,5) and (2, 4). The cummulative summation of distances is 0.13
In the global algorithm the matches would be (1,4) and (2, 5). The cummulative summation of distances is 0.11
Had the order of #treatment.[randomnumber] been different, the the greedy pairs would be different, but I am believing the global would end up being the same.
I would appreciate any help getting started, and I am ok with the greedy algorithm approach. Thank you in advance.
You don't know what you don't know...