Yesterday Joe Celko posted on microsoft.public.sqlserver.programming about how to write an elegant query for Weighted Median. He managed to get the correct results but always ended up with ugly code.
Joe had a feeling an elegant query existed but was not seeing it. Anyway, since Celko's books have helped me in the past, I thought I should help him now.
Consider this sample data
DECLARE @Foo TABLE
(
x INT NOT NULL
)
INSERT @Foo
VALUES (1),
(2),
(2),
(3),
(3),
(3)
The most common approach to calculate the median value I have seen is
SELECT AVG(1.0E * x)
FROM (
SELECT x,
ROW_NUMBER() OVER (ORDER BY x DESC) AS a,
ROW_NUMBER() OVER (ORDER BY x) AS b
FROM @Foo
) AS d
WHERE b - a BETWEEN -1 AND 1
It's a great method and it works in almost all cases. Yes, in almost all cases. There are circumstances where SQL Server will not give the correct result. The reason for me writing this blog post is that I suddenly got this result in return from inner query.
x a b a-b
1 6 1 5
2 4 2 2
2 5 3 2
3 1 4 -3
3 2 5 -3
3 3 6 -3
As you can see, the difference calculated by a-b suddenly doesn't match!
How can we overcome this behaviour? It's not that hard. See this query.
-- Peso Median
SELECT AVG(1.0E * x)
FROM (
SELECT x,
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y
FROM @Foo
) AS d
WHERE y BETWEEN 0 AND 2
Yes it works! And now how to do weighted median? Well, we follow the same approach and write this piece of code.
-- Peso Weighted Median
SELECT SUM(1.0E * y) / SUM(1.0E * t)
FROM (
SELECT SUM(x) OVER (PARTITION BY x) AS y,
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS z,
COUNT(*) OVER (PARTITION BY x) AS t
FROM @Foo
) AS d
WHERE z BETWEEN 0 AND 2
There you go!