Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Calculating a Median without use of cursors and dynamic SQL

T-SQL Hints and Tips

Calculating a Median without use of cursors and dynamic SQL

by  vongrunt  Posted    (Edited  )
Yup. Same subject, 3rd FAQ :p

Tricks and methods found in this FAQ are recycled from thread183-1193016. First we'll explain yet another way how to calculate single median value, then how to calculate median per group without cursors and dynamic SQL - and finally how to exploit SQL2005 ranking functions.


Sample data

Code:
CREATE TABLE sampleData ( groupID int, numValue int )
INSERT INTO sampleData VALUES ( 1,   1 )
INSERT INTO sampleData VALUES ( 1,   2 )
INSERT INTO sampleData VALUES ( 1,   6 )
INSERT INTO sampleData VALUES ( 1,  16 )
INSERT INTO sampleData VALUES ( 1,   7 )
INSERT INTO sampleData VALUES ( 2,   5 )
INSERT INTO sampleData VALUES ( 2,   5 )
INSERT INTO sampleData VALUES ( 2,   5 )
INSERT INTO sampleData VALUES ( 2,  11 )
INSERT INTO sampleData VALUES ( 3,  10 )
INSERT INTO sampleData VALUES ( 3,  17 )
INSERT INTO sampleData VALUES ( 3,  52 )
INSERT INTO sampleData VALUES ( 3,  66 )
INSERT INTO sampleData VALUES ( 4,  18 )
INSERT INTO sampleData VALUES ( 5,   0 )
INSERT INTO sampleData VALUES ( 5,   0 )
Five groups of rows = five median values to calculate at once. For purpose of this testing we assumed numValue cannot be NULL.


Single median value

Let's take 1st group (groupID = 1). To calculate median value we need last value from first half and first value from second half, while taking care about odd count of elements. Looks like TOP 50 PERCENT works exactly as desired:

Code:
DECLARE @groupID int; SET @groupID = 1
DECLARE @M1 int, @M2 int

SELECT TOP 50 PERCENT @M1 = numValue FROM sampleData WHERE groupID = @groupID ORDER BY numValue ASC
SELECT TOP 50 PERCENT @M2 = numValue FROM sampleData WHERE groupID = @groupID ORDER BY numValue DESC

SELECT (@M1+@M2)/2.0
Credits go to donutman who posted that trick on TT long time ago.


Median values per group (SQL Server 2000 and above)

Simple: wrap above code into user-defined function:
Code:
CREATE FUNCTION fn_getMedian( @groupID int )
RETURNS decimal(5, 3)
AS
BEGIN
    DECLARE @m1 int, @m2 int, @ret decimal(5, 3)
    SELECT TOP 50 PERCENT @m1= numValue FROM sampleData WHERE groupID=@groupID ORDER BY numValue ASC
    SELECT TOP 50 PERCENT @m2= numValue FROM sampleData WHERE groupID=@groupID ORDER BY numValue DESC

    SET @ret = (@m1+@m2)/2.0

    RETURN @ret
END
GO
... and then use it as custom aggregate:
Code:
SELECT groupID, dbo.fn_getMedian( groupID)
FROM sampleData
GROUP BY groupID
ORDER BY groupID
Although this code looks fairly elegant, scanning rows ASC/DESC combined with sequential nature of UDFs (called for each new groupID) may cause excessive amount of I/O reads (aka: query will run a bit slower on larger data set).

Another way is to select all rows into identity table and then use identity "row numbers" to get "middle" rows:
Code:
DECLARE @temp TABLE( rownum int identity(1, 1), GroupID int, numValue int )

INSERT INTO @temp( groupID, numValue )
SELECT groupID, numValue
FROM SampleData
ORDER BY groupID, numValue

SELECT A.groupID, AVG(1.0*A.NumValue) AS medianValue
FROM @temp A
INNER JOIN
(    SELECT groupID, FLOOR(AVG(1.0*rownum)) AS m1, CEILING(AVG(1.0*rownum)) AS m2
    FROM @temp
    GROUP BY groupID
) B
ON A.groupID = B.groupID AND A.rownum BETWEEN B.m1 AND B.m2
GROUP BY A.groupID
Alternatively you can use temp #table instead of table @variable. Weird as it may sound, this works faster on larger sets of data - at least in lab conditions where tempdb is not tortured by concurrent users/processes.


Median values per group (SQL Server 2005 only)

Yukon provides several new ways to calculate median. Arguably the most obvious one is with ROW_NUMBER() function:
Code:
SELECT R.groupID, AVG(1.0*R.numValue) AS medianValue
FROM
(    SELECT GroupID, numValue, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY NumValue) AS rowno
    FROM sampleData
) R
INNER JOIN
(    SELECT GroupID, 1+count(*) as N
    FROM sampleData
    GROUP BY GroupID
) G
ON R.GroupID = G.GroupID AND R.rowNo BETWEEN N/2 AND N/2+N%2
GROUP BY R.groupID
Note that RANK() and DENSE_RANK() cannot be used for this purpose due to possible duplicate numValues in group. But NTILE() can be used, though code is not so elegant:
Code:
SET ANSI_WARNINGS OFF
SELECT groupID, CASE WHEN cnt%2=1 THEN t1max ELSE (t1max+t2min)/2.0 END AS medianValue
FROM
(	SELECT groupID, COUNT(numValue) AS cnt,
		MAX(CASE WHEN tile=1 THEN numValue END) AS t1max, 
		MIN(CASE WHEN tile=2 THEN numValue END) AS t2min
	FROM
	(	SELECT groupID, numValue, NTILE(2) OVER( PARTITION BY groupID ORDER BY numValue ) AS tile
		FROM sampleData
	) T
	GROUP BY T.groupID
) G
ORDER BY groupID

FYI method with ROW_NUMBER() is by far and large the fastest of all mentioned.



Revision history:
------------------
2006/02/21 - initial draft
Register to rate this FAQ  : BAD 1 2 3 4 5 6 7 8 9 10 GOOD
Please Note: 1 is Bad, 10 is Good :-)

Part and Inventory Search

Back
Top