Top N Percent per Group
Here's a good question in the feedback from my post about using the T-SQL 2005 features to return the Top N per Group of a result set:
Sani writes:
What about Top n Percent per Group??? I would greatly appreciate an input on that as well.
That's a good question, and also easily solvable. One way to do this that I thought of was by using a combination of rank() and a count(*) partitioned aggregate function, which is also a new SQL Server 2005 feature.Simply calculate the rank() of each row in the group, and also the count(*) of all rows in the group. Multiply the count(*) by the percentage of rows you want returned and filter so that your rank() per group is less than that.
Here's an example. Suppose we want to return the newest 10% of all products per region, where the newest product has the latest "AddedDate" column. We can write it like this:
with ProductsByRegion as
(
select Region, Product, AddedDate,
rank() over (partition by Region order by AddedDate desc) as AddedRank,
count(*) over (partition by Region) as RegionProductsCount
from Products
)
select
Region, Product, AddedDate
from
ProductsByRegion
where
AddedRank <= (RegionProductsCount * .10)
(
select Region, Product, AddedDate,
rank() over (partition by Region order by AddedDate desc) as AddedRank,
count(*) over (partition by Region) as RegionProductsCount
from Products
)
select
Region, Product, AddedDate
from
ProductsByRegion
where
AddedRank <= (RegionProductsCount * .10)
That's really all there is to it.
Update: Some great feedback from Geoff N. Hiten in the first comment to this post shows an even easier solution: just use the NTILE() aggregate function. See his comment for an example. I definitely recommend that approach, it is much simpler and certainly more straightforward since that's pretty much what NTILE is designed to do. Thanks Geoff! (So much for my theory that Jeff's who spell their name with a "J" are always smarter than those who spell it with a "G" !)
These partitioned functions are really amazing and so useful, they are hard to live without once you get the hang of them.
Legacy Comments
Geoff N. Hiten
2008-02-22 |
re: Top N Percent per Group Rank() is good but I prefer ntile() for this purpose with ProductsByRegion as ( select Region, Product, AddedDate, ntile(100) over (partition by Region order by AddedDate desc) as AddedRank, from Products ) select Region, Product, AddedDate from ProductsByRegion where AddedRank <= 10 The nice bit is that you can adjust it for any percentage desired very easily. It also avoids the messy "ties" issue with rank() and dense_rank(). |
Jeff
2008-02-22 |
re: Top N Percent per Group Geoff -- great stuff! I did not think of that -- a much better answer than the one I gave. If you don't mind, I will update the article to point out your comment. Thanks! |
Geoff N. Hiten
2008-02-22 |
re: Top N Percent per Group Thanks. I have been doing a lot of custom data extractions lately that use the various paging and ranking functions so I got very familiar with them It gets really fun when you start putting more than one function in an query and use them to find intersecting sets. |
Aaron
2008-07-04 |
re: Top N Percent per Group Thanks for posting this code. I've been trying to convert some SAS code (Proc Rank) into SQL. This code accomplishes exactly what I needed. Perfect! |
ChrisW
2008-12-24 |
re: Top N Percent per Group Jeff Excellent post ! However I think I uncovered a problem/bug with Geoff's ntile solution becuase while testing your procedure with Geoff's using the same data, I got two different results. It happens when there are fewer than 100 rows in a partition. In my case, 26 rows were returned in a particular partition. The "AddedRank" column returned 1-26 for that partition, so using the where clause of "AddedRank <= 10" did not give me the top 10% but rather the top 38% (10/26). BTW- Enjoy the holidays with your new son Benjamin, time flys by fast and before you know it they are grownup and gone so savor the time now! |
banu
2009-08-16 |
re: Top N Percent per Group what about converting SQL into SAS (Proc Sql)? I've been using ntile function in sql. Do you know any easy way to do the ntiles in SAS, without using 10 pages of macros? Thank you all in advance. Banu |