More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions
I previously wrote about a few of the new features in SQL 2005 and how they can be used to solve some old "classic" SQL problems very easily, and I thought I'd briefly discuss a few more. None of this is earth-shattering stuff, but you may find seeing a bunch of these techniques listed in one place useful.
All examples will be using the excellent Major League Baseball database that you can download for free here. Download the Access version, and then use the Upsizing Wizard to export the data into SQL 2005. It's a great resource and lots of fun for baseball fans; overall the design is fairly good but but not perfect (yearID ?). If you enjoy baseball and you know that you should be practicing your SQL a little more, once you get a hold of this practicing SQL becomes fun and interesting. Lots of my upcoming posts will be using this database, so if you want to "play along" go ahead and download and set it up. Let me know if you have any problems getting the data into SQL Server and, if so, I will write up some instructions. If you don't have Access, there is a CSV text file version available as well, though that will take a bit more work to get into SQL Server.
Returning Top N Rows Per Group
The following will return the top 10 players who hit the most home runs per year since 1990. The key is to calculate the "Home Run Rank" of each player for each year.
HRRanks.*
from
(
Select
b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
from
Batting b
where
b.yearID > 1990
group by
b.yearID, b.playerID
)
HRRanks
where
HRRanks.HR_Rank <= 10
Notice the use of the derived table since we cannot directly reference the Rank() expression in our criteria. To return the player's name, simply join this to the Master table in the database.
The basic idea is this: you PARTITION by the grouping you want to return the top 1-n for, and you ORDER BY the columns that you want to use to do the ranking in that group. So, if you wanted to return the top 10 salesmen per region in terms of total sales, you would calculate RANK() OVER (PARTITION BY Region ORDER BY TotalSales DESC) for each row.
Paging Results with SQL 2005 using ROW_NUMBER()
Paging data is so much easier in SQL 2005. All of my old techniques are no longer needed, which is great because they were hard to implement.
A new function called ROW_NUMBER() works much in the same way as RANK(). Of course, we must have a clear, unique ORDER BY established, otherwise the results will not be deterministic and you will not always get the same rows returned for each page.
In this example, we'll page the "Master Table" of players, sorted by firstname, lastname and using lahmanID (the primary key of the table) as the "tie-breaker". We'll set a couple of variables that could be turned into parameters in a stored proc to indicate the starting and ending rows to return:
declare @endrow int
set @startRow = 40
set @EndRow = 70
select
MasterRowNums.*
from
(
select
m.nameLast, m.nameFirst, m.lahmanID,
ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
from
[master] m
)
MasterRowNums
where
RowNum between @startRow and @endRow
order by
nameLast, NameFirst, lahmanID
Notice that we still cannot reference the function directly in the WHERE clause and are using a derived table again, and also that we must repeat the ordering twice -- once in the ROW_NUMBER() function and once in the ORDER BY clause.
Using this basic technique, it is very easy to see how to write a stored procedure that will let you page rows in a table or SELECT, and it is much more efficient than any pre-SQL 2005 method available. I recommend experimenting with this and doing your paging server-side instead of using client-side paging, since client techniques requires that the database still process and return all of the rows even if they are not being displayed to the user.
Common Table Expressions - Easier Derived Tables
In both of the previous examples, we used derived tables which added an extra layer of complexity to our SQL statement. New in SQL 2005 are Common Table Expressions (commonly called CTEs) which are a much nicer way to work with derived tables, and also much more powerful.
I often instruct people to think one step at a time when writing SQL statements, building each piece as a separate SELECT and then putting them all together at the end. CTEs make this very easy and very intuitive. Basically, instead of writing the derived table "in-line", you can "declare" it first, at the beginning of your SELECT. The previous example would look like this using a CTE:
(
select m.nameLast, m.nameFirst, m.lahmanID,
ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
from [master] m
)
select
MasterRowNums.*
from
MasterRowNums
where
RowNum between @startRow and @endRow
order by
nameLast, NameFirst, lahmanID
I feel that this convention is much easier to read and work with, and certainly easier to test since you can focus on one part of the SELECT at a time without getting lost in mazes of indentation. This is useful when you have lots of derived tables, but even more useful because sometimes you need to reference the same derived table more than once; with standard derived tables, you would have to repeat the entire SQL twice (or create a view or temp table) but here we can declare it just once and reference it as many times as we need.
Here's the Top Rows Per Group solution given above, using a CTE:
(
Select b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
from Batting b
where b.yearID > 1990
group by b.yearID, b.playerID
)
select
HRRanks.*
from
HRRanks
where
HRRanks.HR_Rank <= 10
More to Come!
Stay tuned, there's lots more to come and we'll be using that baseball database quite a bit. We'll mix learning new features in SQL Server 2005 with classic SQL techniques, and along the way we'll do some fun baseball analysis as well.
Legacy Comments
Arun Kumar
2007-05-19 |
re: More with SQL Server 2005 : Top x Per Group, Paging, and Common Table Expressions Hi Sir, No words to say....... I have been searching for about three hours to combine top and group by. At last , I found this one. GREAT. Thank you very much. Regardsm Arun Kumar.A |
Kirby Zhang
2007-06-02 |
re: More with SQL Server 2005 : Top x Per Group, Paging, and Common Table Expressions Same here. I really needed this solution. Thanks a bunch! |
Yasser Abbasi
2007-07-25 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Whenever paging is required, usually the total count of items matching the criteria (before filtering) is also needed, to know exactly how many pages the search is going to have. I have seen lots of examples showing the use of the Row_Number() function, but no examples of getting the total number of items affected by the inner query, in your example: select m.nameLast, m.nameFirst, m.lahmanID, ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum from [master] m I was playing around with the Row_Number function and figured out a way to get the total count: select max(RowNum) from (select ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum from [master] m) as T1 Although it produces the required result, but I'm not sure if this is the most efficient way to do it, as the entire query is repeated. I would appreciate your thought on this. Thanks |
Jeff
2007-07-25 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Hi Yasser -- There really isn't an efficient way that I can think of to return total row count without the need to either process all rows, or process lots of them twice. One thing you can do is use the new feature in sql 2005 that let's you add an aggregate to any select with an OVER clause. For example: select * from ( select m.nameLast, m.nameFirst, m.lahmanID, ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum, COUNT(*) over () as Total from [master] m ) x where RowNum between x and y Short and simple, but it does run twice as slow. I just don't know if there is an efficient way to do this. Either way, I think I'd return the results in a separate SELECT in a separate result set instead of repeating the total row count over and over for each row returned in the SELECT. |
ss
2007-09-06 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions very helpful post.. How can i do this? I have a table with itemid and description columns (duplicates allowed) Now i want to list distinct itemid and and description of latest itemid I want to perform group by on itemid but it will return multiple descriptions per group, I want to find out the single latest description per group. thnx ss |
Jeff
2007-09-06 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions ss -- You have to define what "the description of the latest itemid" means. I recommend going to the SQLTeam forums and asking your question, and be sure to provide SPECIFIC information -- table structure, some sample data, expected results -- so that we can help you out. |
ss
2007-09-07 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions its like this. there are thre columns historyid - (PK), itemid, itemdescription I want to perform groupby on itemid, but for each group i want a single description which has the maximum historyid. |
SURYABHASKAR
2007-10-05 |
Return row other than top 10 from a table Hi, I have a question, Select top 10 * from employee the above statement return top 10 row. but i want the rows from the table other than the top 10. Can any one help me to get it.. iam using SQL server 2005 Thanks Bhaskar |
Greg
2007-11-03 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Thanks! Just used the technique. It rocked!! So tired of the pre-SQL 2005 methodologies. They were so awkward. Q: The outer/lower SELECT's ORDER BY? Is it more efficient to re-do it on the multiple fields or just the RowNum? The RowNum would work. Greg |
Andrew
2008-01-17 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Yassar, not sure if this helps, but have you looked at @@ROWCOUNT? It's an instrinsic (?) property/field/whatever that returns the number of results from the last query. So... DECLARE @TotalCount INT SELECT * FROM Table SET @TotalCount = @@ROWCOUNT Now my problem is using it in conjunction with a CTE. The CTE only lasts to the next select statement, so if I try to SELECT @@ROWCOUNT directly after my CTE (which is getting all the results and tagging them with ROW_NUMBER()), then the CTE gets discarded before I can page. |
SQLGrl
2008-01-23 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Thanks for the great post! I was able to apply the technique and solve my problem in minutes! |
Vivek
2008-01-28 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Simply amazing. I was to get the top 1 row from each group. Without this article, i would'nt have achieved it. |
AP
2008-01-30 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Thanks for the code above. We were able to use the Top and Group functionality to get top 10 sales for each level or our organization. We also used the ROW_NUMBER for another view we were creating. Thanks for the great posts, they are very helpful. Saved us from having to look for hours on end for a solution. |
Sani
2008-02-15 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions What about Top n Percent per Group??? I would greatly appreciate an input on that as well. Thanks. |
Jeff
2008-02-21 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Sani -- great question; I have answered that here: http://weblogs.sqlteam.com/jeffs/archive/2008/02/21/Top-N-Percent-per-Group.aspx |
Roberto
2008-06-25 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Hi Jeff. Thanks for the valuable share. I've got a question though. My sample script is copied below. I plan to change a while-loop to cte-rank to update a datawarehouse that tries to report on user activity history (not the real application but demonstrates what i am trying to achieve). If I am trying to update a big number of rows, it is worth using an indexed temp table instead of a cte right? Only with indexed temp tables, my join will work faster? Any brilliant idea on the method you'll skin this cat? Thanks. use tempdb go create table #t1 (id int, activity varchar(10), mytime smalldatetime, rk int) insert into #t1 (id, activity, mytime) select 1, 'log in', '06-10-2008 10:30:00' union select 1, 'data entry', '06-10-2008 10:40:00' union select 1, 'print', '06-10-2008 10:50:00' union select 2, 'log in', '06-10-2008 10:30:00' union select 2, 'email', '06-10-2008 10:40:00' union select 2, 'log-out', '06-10-2008 10:50:00'; with Rankedt1 as (select id, mytime, rank() over (partition by id order by mytime) as 'rank' from #t1) update #t1 set rk = RT.[rank] from #t1 T join Rankedt1 RT on T.id = RT.id and T.mytime = RT.mytime -- my app uses ssas. this it just to demo of what i want to show select activity, count(1) as ActivityCount, rk as SequenceOfEvents from #t1 group by activity,rk order by rk drop table #t1 |
Pradyumna
2008-07-18 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Fantastic solution. Thanks a lot |
Cindy
2008-08-01 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Thank you! This is great! |
Wieland
2008-08-22 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Is there a way to make this work so the subquery does not have to run in its entirety? I have a query set up like this, but the subquery has to go through 6 million rows to return a result, so performance is bad (actually about the same as if I just put (SELECT TOP 1 ...) in the SELECT clause of the query). ( EXAMPLE: CREATE TABLE 6MillionRowTable (dt smalldatetime, ItemID int, Value int, PRIMARY KEY (DateTime, ItemID)) -- INSERT 6 million rows SELECT * FROM (SELECT dt, ItemID, Value, ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY dt DESC) R FROM 6MillionRowTable ) tbl WHERE tbl.R = 1 ) Since the "ORDER BY" clause of the ROW_NUMBER is on a clustered index I would think it wouldn't have to number the entire table to give me the top value. I just want the query to "look at" the first item for each partition and not scan the entire table. Does this make sense? Is it possible in either SQL 2005 or 2008? Thanks Wieland |
ss
2008-09-09 |
Thanks Thanks for the help. Its solved... |
yasir
2008-10-08 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Hi ss, Can you please post your solution to this problem that you posted? ------------------------------ its like this. there are thre columns historyid - (PK), itemid, itemdescription I want to perform groupby on itemid, but for each group i want a single description which has the maximum historyid. |
JP
2008-12-04 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Thanks for the information. I was wondering how would i loop the Returning Top N Rows Per Group example through a database for regions. For example, top 10 for TX and certain areas, top 10 for CA and certain areas, and so on, depending on how many are in the data set. Attached is an example of my rank and would like to loop this through a data set of regions (i.e. TX, CA, NY, etc) WITH LoanRank AS (SELECT TOP 10 LOAN.loan_desc, Sum(LOAN.loan_amt) AS Amount FROM LOAN_2008 LOAN WHERE LOAN.loan_status = 1 GROUP BY LOAN.loan_desc) SELECT *, Row_Number() OVER (ORDER BY Amount DESC) AS RowNumber FROM LoanRank thanks in advance for your help. |
Jeff S
2008-12-04 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions JP -- I describe exactly how to do that in the blog post. You need to use PARTITION BY. I gave an example. |
Lance
2009-02-08 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions I had a performance related issue with this example: SELECT * FROM ( SELECT ROW_NUMBER() OVER (ORDER BY MySortCol DESC) AS Row,* FROM MyTable WHERE FirstCol > 0 AND (SecondCol IS NULL OR ThirdCol IS NULL) ) AS temp WHERE Row >= 101 AND Row <= 201 When there were only 156 rows in total, SQL Server took over 30 seconds to process this on a table with 4million rows. SQL Express didn't have this issue, and after updating the last line of the query to WHERE Row >= 101 AND Row <=156 it took less than 1 sec. Thought I'd share this - so I always issue a count first now. There are indexes on FirstCol+SecondCol and FirstCol+ThirdCol and MySortCol+FirstCol |
Shaun
2009-02-17 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions What happens when I want to Pass Order By as a parameter. Say I want to order by different fields at different time by calling this same query. I tried to change the statements to look like Order By @SortExpresssion. It didnt work. |
Valentino Vranken
2009-10-28 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Hi Jeff, Thanks for your example on using the RANK() function, it helped me out with getting my query working! I especially liked the last paragraph on that topic. It's a really clear explanation (dare I say better than the one in BOL? Oh well, I guess I just did). Cheers, Valentino. |
Rodrigo
2009-11-26 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Hi Jeff, This example was really helpuful and its gonna save me a great deal of time at work. Greetings from Chile, Rodrigo |
Vampal
2010-01-06 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions I've searching it for a ling time Finding a query to get TOP n Per Grouping thanks very much |
Jeffrey
2010-04-13 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Awesome -- Thanks so much! Top N per Group rocks |
Master B
2010-09-03 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions ALRIGHT! |
Master B
2010-09-03 |
re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions Been trying to get just TOP 1 record from a certain Ordering-Grouping condition I had.... Nothing seemed to have worked until I came accross this page! YOU THE MAN!@!!!@#@ JUST LOVE IT! |