Jeff's SQL Server Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server
posts - 143, comments - 1784, trackbacks - 64

My Links

SQLTeam.com Links

News

Welcome to my weblog. My name is Jeff Smith, and I am a software developer in Boston, MA. Check in frequently for tips, tricks, commentary and ideas on SQL Server and .NET programming.


Subscribe





Archives

Post Categories

Programming

Sports

SQL

More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

I previously wrote about a few of the new features in SQL 2005 and how they can be used to solve some old "classic" SQL problems very easily, and I thought I'd briefly discuss a few more.  None of this is earth-shattering stuff, but you may find seeing a bunch of these techniques listed in one place useful.

All examples will be using the excellent Major League Baseball database that you can download for free here.  Download the Access version, and then use the Upsizing Wizard to export the data into SQL 2005.    It's a great resource and lots of fun for baseball fans; overall the design is fairly good but but not perfect (yearID ?).  If you enjoy baseball and you know that you should be practicing your SQL a little more, once you get a hold of this practicing SQL becomes fun and interesting.  Lots of my upcoming posts will be using this database, so if you want to "play along" go ahead and download and set it up. Let me know if you have any problems getting the data into SQL Server and, if so, I will write up some instructions.  If you don't have Access, there is a CSV text file version available as well, though that will take a bit more work to get into SQL Server. 

Returning Top N Rows Per Group

The following will return the top 10 players who hit the most home runs per year since 1990. The key is to calculate the "Home Run Rank" of each player for each year. 

select
  HRRanks.*
from
(
    Select
      b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
      rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
    from
      Batting b
    where
      b.yearID > 1990
    group by
      b.yearID, b.playerID
)
  HRRanks
where
  HRRanks.HR_Rank <= 10

Notice the use of the derived table since we cannot directly reference the Rank() expression in our criteria.  To return the player's name, simply join this to the Master table in the database.

The basic idea is this:   you PARTITION by the grouping you want to return the top 1-n for, and you ORDER BY the columns that you want to use to do the ranking in that group. So, if you wanted to return the top 10 salesmen per region in terms of total sales, you would calculate RANK() OVER (PARTITION BY Region ORDER BY TotalSales DESC) for each row.

Paging Results with SQL 2005 using ROW_NUMBER()

Paging data is so much easier in SQL 2005.  All of my old techniques are no longer needed, which is great because they were hard to implement.

A new function called ROW_NUMBER() works much in the same way as RANK().  Of course, we must have a clear, unique ORDER BY established, otherwise the results will not be deterministic and you will not always get the same rows returned for each page. 

In this example, we'll page the "Master Table" of players, sorted by firstname, lastname and using lahmanID (the primary key of the table) as the "tie-breaker".  We'll set a couple of variables that could be turned into parameters in a stored proc to indicate the starting and ending rows to return:

declare @startrow int
declare @endrow int

set @startRow = 40
set @EndRow = 70

select
  MasterRowNums.*
from
(
  select
    m.nameLast, m.nameFirst, m.lahmanID,
    ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
  from
    [master] m
)
  MasterRowNums
where
  RowNum between @startRow and @endRow
order by
  nameLast, NameFirst, lahmanID

Notice that we still cannot reference the function directly in the WHERE clause and are using a derived table again, and also that we must repeat the ordering twice -- once in the ROW_NUMBER() function and once in the ORDER BY clause.

Using this basic technique, it is very easy to see how to write a stored procedure that will let you page rows in a table or SELECT, and it is much more efficient than any pre-SQL 2005 method available.   I recommend experimenting with this and doing your paging server-side instead of using client-side paging, since client techniques requires that the database still process and return all of the rows even if they are not being displayed to the user.

Common Table Expressions - Easier Derived Tables

In both of the previous examples, we used derived tables which added an extra layer of complexity to our SQL statement.  New in SQL 2005 are Common Table Expressions (commonly called CTEs) which are a much nicer way to work with derived tables, and also much more powerful.

I often instruct people to think one step at a time when writing SQL statements, building each piece as a separate SELECT and then putting them all together at the end.  CTEs make this very easy and very intuitive.  Basically, instead of writing the derived table "in-line", you can "declare" it first, at the beginning of your SELECT.  The previous example would look like this using a CTE:

with MasterRowNums as
(
  select m.nameLast, m.nameFirst, m.lahmanID,
        ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
  from [master] m
)
select
  MasterRowNums.*
from
  MasterRowNums
where
  RowNum between @startRow and @endRow
order by
  nameLast, NameFirst, lahmanID

I feel that this convention is much easier to read and work with, and certainly easier to test since you can focus on one part of the SELECT at a time without getting lost in mazes of indentation.  This is useful when you have lots of derived tables, but even more useful because sometimes you need to reference the same derived table more than once; with standard derived tables, you would have to repeat the entire SQL twice (or create a view or temp table) but here we can declare it just once and reference it as many times as we need.

Here's the Top Rows Per Group solution given above, using a CTE:

with HRRanks as
(
    Select b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
        rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
    from Batting b
    where b.yearID > 1990
    group by b.yearID, b.playerID
   
)
select
   HRRanks.*
from
   HRRanks
where
  HRRanks.HR_Rank <= 10


More to Come!

Stay tuned, there's lots more to come and we'll be using that baseball database quite a bit.  We'll mix learning new features in SQL Server 2005 with classic SQL techniques, and along the way we'll do some fun baseball analysis as well.

Print | posted on Friday, March 30, 2007 9:21 AM

Feedback

# re: More with SQL Server 2005 : Top x Per Group, Paging, and Common Table Expressions

Hi Sir,

No words to say.......

I have been searching for about three hours to combine top and group by.

At last , I found this one.


GREAT.


Thank you very much.


Regardsm
Arun Kumar.A
5/19/2007 10:48 PM | Arun Kumar

# re: More with SQL Server 2005 : Top x Per Group, Paging, and Common Table Expressions

Same here. I really needed this solution. Thanks a bunch!
6/2/2007 5:51 AM | Kirby Zhang

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Whenever paging is required, usually the total count of items matching the criteria (before filtering) is also needed, to know exactly how many pages the search is going to have.

I have seen lots of examples showing the use of the Row_Number() function, but no examples of getting the total number of items affected by the inner query, in your example:

select
m.nameLast, m.nameFirst, m.lahmanID,
ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
from
[master] m

I was playing around with the Row_Number function and figured out a way to get the total count:

select max(RowNum) from (select ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
from [master] m) as T1

Although it produces the required result, but I'm not sure if this is the most efficient way to do it, as the entire query is repeated.

I would appreciate your thought on this.

Thanks
7/25/2007 7:41 AM | Yasser Abbasi

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi Yasser --

There really isn't an efficient way that I can think of to return total row count without the need to either process all rows, or process lots of them twice. One thing you can do is use the new feature in sql 2005 that let's you add an aggregate to any select with an OVER clause.

For example:

select *
from
(
select
m.nameLast, m.nameFirst, m.lahmanID,
ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum,
COUNT(*) over () as Total
from
[master] m
) x
where RowNum between x and y

Short and simple, but it does run twice as slow. I just don't know if there is an efficient way to do this. Either way, I think I'd return the results in a separate SELECT in a separate result set instead of repeating the total row count over and over for each row returned in the SELECT.
7/25/2007 8:44 AM | Jeff

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

very helpful post..

How can i do this?

I have a table with itemid and description columns (duplicates allowed)
Now i want to list distinct itemid and and description of latest itemid

I want to perform group by on itemid but it will return multiple descriptions per group, I want to find out the single latest description per group.

thnx
ss
9/6/2007 9:27 AM | ss

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

ss --

You have to define what "the description of the latest itemid" means. I recommend going to the SQLTeam forums and asking your question, and be sure to provide SPECIFIC information -- table structure, some sample data, expected results -- so that we can help you out.
9/6/2007 9:54 AM | Jeff

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions



its like this.

there are thre columns historyid - (PK), itemid, itemdescription

I want to perform groupby on itemid, but for each group i want a single description which has the maximum historyid.

9/7/2007 12:02 PM | ss

# Return row other than top 10 from a table

Hi,
I have a question,

Select top 10 * from employee

the above statement return top 10 row.

but i want the rows from the table other than the top 10. Can any one help me to get it.. iam using SQL server 2005

Thanks
Bhaskar
10/5/2007 1:17 PM | SURYABHASKAR

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks! Just used the technique. It rocked!! So tired of the pre-SQL 2005 methodologies. They were so awkward.

Q: The outer/lower SELECT's ORDER BY? Is it more efficient to re-do it on the multiple fields or just the RowNum? The RowNum would work.

Greg
11/3/2007 4:54 PM | Greg

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Yassar, not sure if this helps, but have you looked at @@ROWCOUNT? It's an instrinsic (?) property/field/whatever that returns the number of results from the last query. So...

DECLARE @TotalCount INT
SELECT * FROM Table
SET @TotalCount = @@ROWCOUNT

Now my problem is using it in conjunction with a CTE. The CTE only lasts to the next select statement, so if I try to SELECT @@ROWCOUNT directly after my CTE (which is getting all the results and tagging them with ROW_NUMBER()), then the CTE gets discarded before I can page.
1/17/2008 7:14 PM | Andrew

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks for the great post! I was able to apply the technique and solve my problem in minutes!
1/23/2008 11:07 AM | SQLGrl

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Simply amazing. I was to get the top 1 row from each group. Without this article, i would'nt have achieved it.
1/28/2008 2:12 AM | Vivek

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Thanks for the code above. We were able to use the Top and Group functionality to get top 10 sales for each level or our organization. We also used the ROW_NUMBER for another view we were creating. Thanks for the great posts, they are very helpful. Saved us from having to look for hours on end for a solution.
1/30/2008 8:33 AM | AP

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

What about Top n Percent per Group??? I would greatly appreciate an input on that as well.

Thanks.
2/15/2008 5:58 AM | Sani

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Sani -- great question; I have answered that here:

http://weblogs.sqlteam.com/jeffs/archive/2008/02/21/Top-N-Percent-per-Group.aspx
2/21/2008 1:11 PM | Jeff

# re: More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

Hi Jeff. Thanks for the valuable share. I've got a question though. My sample script is copied below. I plan to change a while-loop to cte-rank to update a datawarehouse that tries to report on user activity history (not the real application but demonstrates what i am trying to achieve). If I am trying to update a big number of rows, it is worth using an indexed temp table instead of a cte right? Only with indexed temp tables, my join will work faster? Any brilliant idea on the method you'll skin this cat? Thanks.

use tempdb
go

create table #t1 (id int, activity varchar(10), mytime smalldatetime, rk int)

insert into #t1 (id, activity, mytime)
select 1, 'log in', '06-10-2008 10:30:00' union
select 1, 'data entry', '06-10-2008 10:40:00' union
select 1, 'print', '06-10-2008 10:50:00' union
select 2, 'log in', '06-10-2008 10:30:00' union
select 2, 'email', '06-10-2008 10:40:00' union
select 2, 'log-out', '06-10-2008 10:50:00';

with Rankedt1 as
(select id, mytime, rank() over (partition by id order by mytime) as 'rank' from #t1)
update #t1
set rk = RT.[rank]
from #t1 T
join Rankedt1 RT
on T.id = RT.id and T.mytime = RT.mytime

-- my app uses ssas. this it just to demo of what i want to show
select activity, count(1) as ActivityCount, rk as SequenceOfEvents from #t1
group by activity,rk
order by rk

drop table #t1
6/25/2008 5:31 AM | Roberto

Post Comment

Title  
Name  
Email
Url
Comment   
Please add 7 and 7 and type the answer here:

Powered by: