Jeff's SQL Server Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server
posts - 157, comments - 2686, trackbacks - 64

My Links

Advertisement

News

Welcome to my weblog. My name is Jeff Smith, I am software developer in Boston, MA and I was recently named a 2009 SQL Server MVP. Check in frequently for tips, tricks, commentary and ideas on SQL Server and .NET programming.


Subscribe





Archives

Post Categories

Programming

Sports

SQL

More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

I previously wrote about a few of the new features in SQL 2005 and how they can be used to solve some old "classic" SQL problems very easily, and I thought I'd briefly discuss a few more.  None of this is earth-shattering stuff, but you may find seeing a bunch of these techniques listed in one place useful.

All examples will be using the excellent Major League Baseball database that you can download for free here.  Download the Access version, and then use the Upsizing Wizard to export the data into SQL 2005.    It's a great resource and lots of fun for baseball fans; overall the design is fairly good but but not perfect (yearID ?).  If you enjoy baseball and you know that you should be practicing your SQL a little more, once you get a hold of this practicing SQL becomes fun and interesting.  Lots of my upcoming posts will be using this database, so if you want to "play along" go ahead and download and set it up. Let me know if you have any problems getting the data into SQL Server and, if so, I will write up some instructions.  If you don't have Access, there is a CSV text file version available as well, though that will take a bit more work to get into SQL Server. 

Returning Top N Rows Per Group

The following will return the top 10 players who hit the most home runs per year since 1990. The key is to calculate the "Home Run Rank" of each player for each year. 

select
  HRRanks.*
from
(
    Select
      b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
      rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
    from
      Batting b
    where
      b.yearID > 1990
    group by
      b.yearID, b.playerID
)
  HRRanks
where
  HRRanks.HR_Rank <= 10

Notice the use of the derived table since we cannot directly reference the Rank() expression in our criteria.  To return the player's name, simply join this to the Master table in the database.

The basic idea is this:   you PARTITION by the grouping you want to return the top 1-n for, and you ORDER BY the columns that you want to use to do the ranking in that group. So, if you wanted to return the top 10 salesmen per region in terms of total sales, you would calculate RANK() OVER (PARTITION BY Region ORDER BY TotalSales DESC) for each row.

Paging Results with SQL 2005 using ROW_NUMBER()

Paging data is so much easier in SQL 2005.  All of my old techniques are no longer needed, which is great because they were hard to implement.

A new function called ROW_NUMBER() works much in the same way as RANK().  Of course, we must have a clear, unique ORDER BY established, otherwise the results will not be deterministic and you will not always get the same rows returned for each page. 

In this example, we'll page the "Master Table" of players, sorted by firstname, lastname and using lahmanID (the primary key of the table) as the "tie-breaker".  We'll set a couple of variables that could be turned into parameters in a stored proc to indicate the starting and ending rows to return:

declare @startrow int
declare @endrow int

set @startRow = 40
set @EndRow = 70

select
  MasterRowNums.*
from
(
  select
    m.nameLast, m.nameFirst, m.lahmanID,
    ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
  from
    [master] m
)
  MasterRowNums
where
  RowNum between @startRow and @endRow
order by
  nameLast, NameFirst, lahmanID

Notice that we still cannot reference the function directly in the WHERE clause and are using a derived table again, and also that we must repeat the ordering twice -- once in the ROW_NUMBER() function and once in the ORDER BY clause.

Using this basic technique, it is very easy to see how to write a stored procedure that will let you page rows in a table or SELECT, and it is much more efficient than any pre-SQL 2005 method available.   I recommend experimenting with this and doing your paging server-side instead of using client-side paging, since client techniques requires that the database still process and return all of the rows even if they are not being displayed to the user.

Common Table Expressions - Easier Derived Tables

In both of the previous examples, we used derived tables which added an extra layer of complexity to our SQL statement.  New in SQL 2005 are Common Table Expressions (commonly called CTEs) which are a much nicer way to work with derived tables, and also much more powerful.

I often instruct people to think one step at a time when writing SQL statements, building each piece as a separate SELECT and then putting them all together at the end.  CTEs make this very easy and very intuitive.  Basically, instead of writing the derived table "in-line", you can "declare" it first, at the beginning of your SELECT.  The previous example would look like this using a CTE:

with MasterRowNums as
(
  select m.nameLast, m.nameFirst, m.lahmanID,
        ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
  from [master] m
)
select
  MasterRowNums.*
from
  MasterRowNums
where
  RowNum between @startRow and @endRow
order by
  nameLast, NameFirst, lahmanID

I feel that this convention is much easier to read and work with, and certainly easier to test since you can focus on one part of the SELECT at a time without getting lost in mazes of indentation.  This is useful when you have lots of derived tables, but even more useful because sometimes you need to reference the same derived table more than once; with standard derived tables, you would have to repeat the entire SQL twice (or create a view or temp table) but here we can declare it just once and reference it as many times as we need.

Here's the Top Rows Per Group solution given above, using a CTE:

with HRRanks as
(
    Select b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
        rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
    from Batting b
    where b.yearID > 1990
    group by b.yearID, b.playerID
   
)
select
   HRRanks.*
from
   HRRanks
where
  HRRanks.HR_Rank <= 10


More to Come!

Stay tuned, there's lots more to come and we'll be using that baseball database quite a bit.  We'll mix learning new features in SQL Server 2005 with classic SQL techniques, and along the way we'll do some fun baseball analysis as well.

Print | posted on Friday, March 30, 2007 9:21 AM | Filed Under [ T-SQL Sports Paging Data SQL Server 2005 ]

Feedback

Gravatar

# re: More with SQL Server 2005 : Top x Per Group, Paging, and Common Table Expressions

Hi Sir,

No words to say.......

I have been searching for about three hours to combine top and group by.

At last , I found this one.


GREAT.


Thank you very much.


Regardsm
Arun Kumar.A
5/19/2007 10:48 PM | Arun Kumar
Gravatar

# re: More with SQL Server 2005 : Top x Per Group, Paging, and Common Table Expressions

Same here. I really needed this solution. Thanks a bunch!
6/2/2007 5:51 AM | Kirby Zhang
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Whenever paging is required, usually the total count of items matching the criteria (before filtering) is also needed, to know exactly how many pages the search is going to have.

I have seen lots of examples showing the use of the Row_Number() function, but no examples of getting the total number of items affected by the inner query, in your example:

select
m.nameLast, m.nameFirst, m.lahmanID,
ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
from
[master] m

I was playing around with the Row_Number function and figured out a way to get the total count:

select max(RowNum) from (select ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
from [master] m) as T1

Although it produces the required result, but I'm not sure if this is the most efficient way to do it, as the entire query is repeated.

I would appreciate your thought on this.

Thanks
7/25/2007 7:41 AM | Yasser Abbasi
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi Yasser --

There really isn't an efficient way that I can think of to return total row count without the need to either process all rows, or process lots of them twice. One thing you can do is use the new feature in sql 2005 that let's you add an aggregate to any select with an OVER clause.

For example:

select *
from
(
select
m.nameLast, m.nameFirst, m.lahmanID,
ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum,
COUNT(*) over () as Total
from
[master] m
) x
where RowNum between x and y

Short and simple, but it does run twice as slow. I just don't know if there is an efficient way to do this. Either way, I think I'd return the results in a separate SELECT in a separate result set instead of repeating the total row count over and over for each row returned in the SELECT.
7/25/2007 8:44 AM | Jeff
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

very helpful post..

How can i do this?

I have a table with itemid and description columns (duplicates allowed)
Now i want to list distinct itemid and and description of latest itemid

I want to perform group by on itemid but it will return multiple descriptions per group, I want to find out the single latest description per group.

thnx
ss
9/6/2007 9:27 AM | ss
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

ss --

You have to define what "the description of the latest itemid" means. I recommend going to the SQLTeam forums and asking your question, and be sure to provide SPECIFIC information -- table structure, some sample data, expected results -- so that we can help you out.
9/6/2007 9:54 AM | Jeff
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions



its like this.

there are thre columns historyid - (PK), itemid, itemdescription

I want to perform groupby on itemid, but for each group i want a single description which has the maximum historyid.

9/7/2007 12:02 PM | ss
Gravatar

# Return row other than top 10 from a table

Hi,
I have a question,

Select top 10 * from employee

the above statement return top 10 row.

but i want the rows from the table other than the top 10. Can any one help me to get it.. iam using SQL server 2005

Thanks
Bhaskar
10/5/2007 1:17 PM | SURYABHASKAR
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks! Just used the technique. It rocked!! So tired of the pre-SQL 2005 methodologies. They were so awkward.

Q: The outer/lower SELECT's ORDER BY? Is it more efficient to re-do it on the multiple fields or just the RowNum? The RowNum would work.

Greg
11/3/2007 4:54 PM | Greg
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Yassar, not sure if this helps, but have you looked at @@ROWCOUNT? It's an instrinsic (?) property/field/whatever that returns the number of results from the last query. So...

DECLARE @TotalCount INT
SELECT * FROM Table
SET @TotalCount = @@ROWCOUNT

Now my problem is using it in conjunction with a CTE. The CTE only lasts to the next select statement, so if I try to SELECT @@ROWCOUNT directly after my CTE (which is getting all the results and tagging them with ROW_NUMBER()), then the CTE gets discarded before I can page.
1/17/2008 7:14 PM | Andrew
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks for the great post! I was able to apply the technique and solve my problem in minutes!
1/23/2008 11:07 AM | SQLGrl
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Simply amazing. I was to get the top 1 row from each group. Without this article, i would'nt have achieved it.
1/28/2008 2:12 AM | Vivek
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks for the code above. We were able to use the Top and Group functionality to get top 10 sales for each level or our organization. We also used the ROW_NUMBER for another view we were creating. Thanks for the great posts, they are very helpful. Saved us from having to look for hours on end for a solution.
1/30/2008 8:33 AM | AP
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

What about Top n Percent per Group??? I would greatly appreciate an input on that as well.

Thanks.
2/15/2008 5:58 AM | Sani
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Sani -- great question; I have answered that here:

http://weblogs.sqlteam.com/jeffs/archive/2008/02/21/Top-N-Percent-per-Group.aspx
2/21/2008 1:11 PM | Jeff
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi Jeff. Thanks for the valuable share. I've got a question though. My sample script is copied below. I plan to change a while-loop to cte-rank to update a datawarehouse that tries to report on user activity history (not the real application but demonstrates what i am trying to achieve). If I am trying to update a big number of rows, it is worth using an indexed temp table instead of a cte right? Only with indexed temp tables, my join will work faster? Any brilliant idea on the method you'll skin this cat? Thanks.

use tempdb
go

create table #t1 (id int, activity varchar(10), mytime smalldatetime, rk int)

insert into #t1 (id, activity, mytime)
select 1, 'log in', '06-10-2008 10:30:00' union
select 1, 'data entry', '06-10-2008 10:40:00' union
select 1, 'print', '06-10-2008 10:50:00' union
select 2, 'log in', '06-10-2008 10:30:00' union
select 2, 'email', '06-10-2008 10:40:00' union
select 2, 'log-out', '06-10-2008 10:50:00';

with Rankedt1 as
(select id, mytime, rank() over (partition by id order by mytime) as 'rank' from #t1)
update #t1
set rk = RT.[rank]
from #t1 T
join Rankedt1 RT
on T.id = RT.id and T.mytime = RT.mytime

-- my app uses ssas. this it just to demo of what i want to show
select activity, count(1) as ActivityCount, rk as SequenceOfEvents from #t1
group by activity,rk
order by rk

drop table #t1
6/25/2008 5:31 AM | Roberto
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Fantastic solution. Thanks a lot
7/18/2008 2:28 AM | Pradyumna
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thank you! This is great!
8/1/2008 12:26 PM | Cindy
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Is there a way to make this work so the subquery does not have to run in its entirety? I have a query set up like this, but the subquery has to go through 6 million rows to return a result, so performance is bad (actually about the same as if I just put (SELECT TOP 1 ...) in the SELECT clause of the query).

(
EXAMPLE:
CREATE TABLE 6MillionRowTable (dt smalldatetime, ItemID int, Value int, PRIMARY KEY (DateTime, ItemID))
-- INSERT 6 million rows
SELECT *
FROM (SELECT dt, ItemID, Value, ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY dt DESC) R
FROM 6MillionRowTable
) tbl
WHERE tbl.R = 1
)
Since the "ORDER BY" clause of the ROW_NUMBER is on a clustered index I would think it wouldn't have to number the entire table to give me the top value. I just want the query to "look at" the first item for each partition and not scan the entire table. Does this make sense? Is it possible in either SQL 2005 or 2008?

Thanks
Wieland
8/22/2008 2:10 PM | Wieland
Gravatar

# Thanks

Thanks for the help.

Its solved...
9/9/2008 4:24 AM | ss
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi ss,
Can you please post your solution to this problem that you posted?

------------------------------

its like this.

there are thre columns historyid - (PK), itemid, itemdescription

I want to perform groupby on itemid, but for each group i want a single description which has the maximum historyid.
10/8/2008 3:08 PM | yasir
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks for the information. I was wondering how would i loop the Returning Top N Rows Per Group example through a database for regions. For example, top 10 for TX and certain areas, top 10 for CA and certain areas, and so on, depending on how many are in the data set. Attached is an example of my rank and would like to loop this through a data set of regions (i.e. TX, CA, NY, etc)

WITH LoanRank AS

(SELECT TOP 10 LOAN.loan_desc,
Sum(LOAN.loan_amt) AS Amount

FROM LOAN_2008 LOAN

WHERE LOAN.loan_status = 1

GROUP BY LOAN.loan_desc)


SELECT *, Row_Number() OVER (ORDER BY Amount DESC) AS RowNumber
FROM LoanRank




thanks in advance for your help.
12/4/2008 1:03 PM | JP
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

JP -- I describe exactly how to do that in the blog post. You need to use PARTITION BY. I gave an example.
12/4/2008 1:13 PM | Jeff S
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

I had a performance related issue with this example:

SELECT *
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY MySortCol DESC) AS Row,* FROM MyTable WHERE FirstCol > 0 AND (SecondCol IS NULL OR ThirdCol IS NULL)
)
AS temp
WHERE Row >= 101 AND Row <= 201

When there were only 156 rows in total, SQL Server took over 30 seconds to process this on a table with 4million rows. SQL Express didn't have this issue, and after updating the last line of the query to

WHERE Row >= 101 AND Row <=156

it took less than 1 sec. Thought I'd share this - so I always issue a count first now. There are indexes on FirstCol+SecondCol and FirstCol+ThirdCol and MySortCol+FirstCol
2/8/2009 10:06 PM | Lance
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

What happens when I want to Pass Order By as a parameter. Say I want to order by different fields at different time by calling this same query. I tried to change the statements to look like Order By @SortExpresssion. It didnt work.
2/17/2009 7:46 PM | Shaun
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi Jeff,

Thanks for your example on using the RANK() function, it helped me out with getting my query working!
I especially liked the last paragraph on that topic. It's a really clear explanation (dare I say better than the one in BOL? Oh well, I guess I just did).

Cheers,
Valentino.
10/28/2009 5:09 PM | Valentino Vranken
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi Jeff,

This example was really helpuful and its gonna save me a great deal of time at work.

Greetings from Chile,
Rodrigo
11/26/2009 8:16 AM | Rodrigo
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

I've searching it for a ling time
Finding a query to get TOP n Per Grouping
thanks very much
1/6/2010 2:22 AM | Vampal
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Awesome -- Thanks so much! Top N per Group rocks
4/13/2010 2:00 PM | Jeffrey
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

ALRIGHT!
9/3/2010 12:32 PM | Master B
Gravatar

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Been trying to get just TOP 1 record from a certain Ordering-Grouping condition I had....
Nothing seemed to have worked until I came accross this page!

YOU THE MAN!@!!!@#@ JUST LOVE IT!
9/3/2010 12:34 PM | Master B
Comments have been closed on this topic.

Powered by:
Powered By Subtext Powered By ASP.NET