Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

SQL Server 2005: Using PARTITION and RANK in your criteria

The RANK and PARTITION features in 2005 are simply amazing.  They make so many "classic" SQL problems very easy to solve.  For example, consider the following table:

create table Batting
 (Player varchar(10), Year int, Team varchar(10), HomeRuns int, primary key(Player,Year))

insert into Batting
select 'A',2001,'Red Sox',13 union all
select 'A',2002,'Red Sox',23 union all
select 'A',2003,'Red Sox',19 union all
select 'A',2004,'Red Sox',14 union all
select 'A',2005,'Red Sox',11 union all
select 'B',2001,'Yankees',42 union all
select 'B',2002,'Yankees',39 union all
select 'B',2003,'Yankees',42 union all
select 'B',2004,'Yankees',29 union all
select 'C',2002,'Yankees',2 union all
select 'C',2003,'Yankees',3 union all
select 'C',2004,'Red Sox',6 union all
select 'C',2005,'Red Sox',9


Suppose we would like to find out which year each player hit their most home runs, and which team they played for.  As a tie-breaker, return the latest year.  

In SQL 2000, we had to do it in two steps:  First, get the MAX(HomeRuns) per player, and then join back to the Batting table to return the rest of the data:

select b.*
from
batting b
inner join
(    select player, Max(HomeRuns) as MaxHR
    from Batting
    group by player
) m
on b.Player = m.player and b.HomeRuns = m.MaxHR

Player     Year        Team       HomeRuns   
---------- ----------- ---------- -----------
A          2002        Red Sox    23
B          2001        Yankees    42
B          2003        Yankees    42
C          2005        Red Sox    9

(4 row(s) affected)


Note that for player 'B', we get two rows back since he has two years that tie for the most home runs (2001 and 2003).  How do we return just the latest year?  Guess what -- it requires *another* pass through the table:

select b.*
from
batting b
inner join
(    select player, Max(HomeRuns) as MaxHR
    from Batting
    group by player
) m
    on b.Player = m.player and b.HomeRuns = m.MaxHR
inner join
(  select player, homeRuns, Max(Year) as MaxYear
   from Batting
   group by Player, HomeRuns
) y
   on m.player= y.player and m.maxHR = y.HomeRuns and b.Year = y.MaxYear

Player     Year        Team       HomeRuns   
---------- ----------- ---------- -----------
C          2005        Red Sox    9
A          2002        Red Sox    23
B          2003        Yankees    42

(3 row(s) affected)

An alternate way to do this is to calculate the "ranking" of each home run for each player, using a correlated subquery:

select b.*,
  (select count(*) from batting b2 where b.player = b2.player and b2.HomeRuns >= b.HomeRuns) as Rank
from batting b

Player     Year        Team       HomeRuns    Rank        
---------- ----------- ---------- ----------- -----------
A          2001        Red Sox    13          4
A          2002        Red Sox    23          1
A          2003        Red Sox    19          2
A          2004        Red Sox    14          3
A          2005        Red Sox    11          5
B          2001        Yankees    42          2
B          2002        Yankees    39          3
B          2003        Yankees    42          2
B          2004        Yankees    29          4
C          2002        Yankees    2           4
C          2003        Yankees    3           3
C          2004        Red Sox    6           2
C          2005        Red Sox    9           1

However, notice that we still have not handled ties! (notice that Player "B" has no #1 ranking, just two #2 rankings!) To do that, we must make things a little more complicated:

select b.*,
  (select count(*) from batting b2 where b.player = b2.player and (b2.HomeRuns > b.HomeRuns or (b2.HomeRuns = b.HomeRuns and b2.Year >= b.Year))) as Rank
from batting b

Player     Year        Team       HomeRuns    Rank        
---------- ----------- ---------- ----------- -----------
A          2001        Red Sox    13          4
A          2002        Red Sox    23          1
A          2003        Red Sox    19          2
A          2004        Red Sox    14          3
A          2005        Red Sox    11          5
B          2001        Yankees    42          2
B          2002        Yankees    39          3
B          2003        Yankees    42          1
B          2004        Yankees    29          4
C          2002        Yankees    2           4
C          2003        Yankees    3           3
C          2004        Red Sox    6           2
C          2005        Red Sox    9           1

(13 row(s) affected)

And, with that, we can use our "ranking" formula to return only the #1 rankings to get our results by moving the subquery to the WHERE clause:

select b.*
from batting b
where (select count(*) from batting b2 where b.player = b2.player and (b2.HomeRuns > b.HomeRuns or (b2.HomeRuns = b.HomeRuns and b2.Year >= b.Year))) =1

Player     Year        Team       HomeRuns   
---------- ----------- ---------- -----------
C          2005        Red Sox    9
A          2002        Red Sox    23
B          2003        Yankees    42

(3 row(s) affected)

And that's all you need to do this in SQL 2000 ! Easy, right?   Hmmm ... well, maybe not!

The new ranking functions in SQL Server 2005 let us quickly calculate each row's ranking within a set based on a partition and an ordering. Think of the partition as almost like a GROUP BY, where you the use of the word "per" in your specifications often indicate which columns you are grouping or partitioning on.  Since we want to return the top Home Run year per Player, we partition on Player.  This means that the rows for each Player will get a value of 1-x for their ranking.  The ranking itself is based on HomeRuns, so we want to return the rank for each partition ORDER'ed BY HomeRuns from high-to-low (DESC).  

Using the new features, in SQL 2005 we can use the following SELECT to return each year's Home Run ranking for each Player:

select Player, Year, HomeRuns, Rank() over (Partition BY Player order by HomeRuns DESC) as Rank
from
Batting

Now, like before, we have to deal with ties.  But now, it is much easier -- we just add a secondary sort.  Since we want to the latest year to rank higher, we just add "Year DESC" to our ORDER BY:

select Player, Year, HomeRuns,Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) as Rank
from
Batting

Much easier than the old way!  However, we are still not done.  The above SQL returns all years with the Ranking beside it.  We are only interested in returning all rows with a Rank of 1.  However, if we try:

select Player, Year, HomeRuns
from
Batting
where
 Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) = 1

We get an error:

Windowed functions can only appear in the SELECT or ORDER BY clauses.

We can easily fix this by wrapping the SELECT in a derived table:

select * from
(
select Player, Year, HomeRuns,  Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) as Rank
from
Batting
) tmp
where Rank = 1

And there you go ... It's easy to read and edit, and only 1 reference to the table is needed!  Preliminary testing also indicates that this is much more efficient than doing things "SQL 2000-style".

Other challenging SQL statements to write in the pre-2005 world are also easily solved in SQL 2005.  For example, the classic SQL problem of returning the "top x per group" based on a sort, or returning a "row number" or sequence for an entire result set or groups within that set, are now very, very easy using the RANK() feature.  Try it out!

Legacy Comments


Rick O
2007-03-29
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Just for completeness, you can also do it without the RANK/PARTITION via:

SELECT a.Player, a.Year, a.Team, a.HomeRuns
FROM Batting AS a
LEFT OUTER JOIN Batting AS b ON (a.Player = b.Player) AND (a.HomeRuns <= b.HomeRuns)
GROUP BY a.Player, a.Year, a.Team, a.HomeRuns
HAVING a.HomeRuns = MAX(b.HomeRuns)
AND ((COUNT(b.Year) = 1) OR (a.Year = MAX(b.Year)))

I guess it's a toss-up as to which someone fight find more intuitive and easier to read.

Jeff
2007-03-29
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Thanks, Rick O, that's quite good, I didn't think of that one. An INNER JOIN is actually all you need there, and in fact putting criteria on outer table (in the HAVING clause) turns the OUTER JOIN into an INNER one anyway.

You can also do it with NOT EXISTS(..) and NOT IN(...) as well.

Abhineet
2007-09-13
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Hi Sir,

Nice information, but if we check the performance we use without rank and partion function query will take a lot of cost.


sachin wandhare
2007-10-08
re: SQL Server 2005: Using PARTITION and RANK in your criteria
SQL Server 2005: Using PARTITION and RANK in your criteria

Ajay
2007-12-04
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Hi Jeff,
Thanks a Lot.
Explanation of Rank and Partition is Great. this is what iam looking for.

Hi Ricko : thanks for the alternate query that is written with out using the rank and partition which is very use ful when we are writing queries that are compatible with sql server 2000.


Jon Whalley
2008-08-04
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Another alternative could be this:


SELECT
B.*
FROM
Batting B
INNER JOIN
(
SELECT B.Player, MAX(B.[Year]) AS MaxYear
FROM Batting B
INNER JOIN
(
SELECT Player, MAX(HomeRuns) AS MaxHomeRuns FROM Batting GROUP BY Player
) R ON R.Player = B.Player AND R.MaxHomeRuns = B.HomeRuns
GROUP BY B.Player
) R ON R.Player = B.Player AND R.MaxYear = B.[Year]


This may appear more complicated than the previous two versions at first. The method to this approach is fairly straight forward:

This uses two nested selects. The deepest select determines the maximum number of home runs for each player. This information is then used to determine the highest year that each player achieved those maximum home runs. This player/year primary key composition is then used to retrieve the entire record from the table.

In my opinion when choosing which method to use, performance is a key factor to be considered.

I ran a series of tests where all three methods are executed in sequence. The execution plans indicate the relative costs of each method. 5 tests were run, initially for 13 records, and then incremented by 13 thereafter. The outcome was as follows:

A - Partition Method
B - Rick O's Method
C - My Method

Recs A B C
13 34% 18% 25%
26 26% 19% 19%
39 21% 21% 15%
52 18% 22% 13%
65 16% 23% 11%

(Note: The initial INSERT statements account for the remaining cost)

Draw your own conclusion from this.






Chamila
2008-08-13
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Nice article Jeff!
Thanks a lot...

Mike Renwick
2008-08-19
re: SQL Server 2005: Using PARTITION and RANK in your criteria
I came up with something you can do with Rank/partition that is very difficult otherwise...

Consider a table that contains the columns "Entity","Rating","Date" representing the rating of an entity on a certain day
I wanted to create a table that had the columns
"Entity","Rating","From date","To date" where it summarized the rows into blocks.

The issue is that I wanted to store a single row of data for each change in rating over time, but without losing the contiguous nature of the data, in this case Rating of a company that may change from AAA to AA for a week and then go back to AAA.

If I grouped by rating, I lost the contiguous dates, but couldn't group by a date range easily, so I was left scratching my head as to what to do.

I figured that by creating two fields,
one
rank() over (partition by Entity order by date)
rank() over (partition by Entity,Rating order by date)

The first gives me sequential numbers for each of the rows for an entity
The second gives me sequential numbers for each of the rows for an entity+rating

so for this data
(1,AAA,1/1/2008)
(1,AAA,2/1/2008)
(1,AAA,3/1/2008)
(1,A,4/1/2008)
(1,A,5/1/2008)
(1,AAA,6/1/2008)
(1,AAA,7/1/2008)
(1,AAA,8/1/2008)

I would get
(1,AAA,1/1/2008,1,1)
(1,AAA,2/1/2008,2,2)
(1,AAA,3/1/2008,3,3)
(1,A,4/1/2008,4,1)
(1,A,5/1/2008,5,2)
(1,AAA,6/1/2008,6,4)
(1,AAA,7/1/2008,7,5)
(1,AAA,8/1/2008,8,6)

As you can see the last column carries on incrementing the next block of AAA where the last AAA left off (1.2.3......4.5.6)

I figured by subtracting one from the other, I get a number that stays the same as long as the rows are contiguous, hence gives me a means to aggregate to find the max and minimum

The full code follows for those interested


create table #temp (Entity varchar(10),Rating varchar(10),date smalldatetime)

insert into #temp values ('1','AAA','2008/01/02')
insert into #temp values ('1','A','2008/01/03')
insert into #temp values ('1','A','2008/01/04')
insert into #temp values ('1','AAA','2008/01/05')
insert into #temp values ('1','AAA','2008/01/06')
insert into #temp values ('1','A','2008/01/07')
insert into #temp values ('1','AAA','2008/01/08')
insert into #temp values ('1','A','2008/01/09')
insert into #temp values ('1','AAA','2008/01/10')

select Entity,Rating,min(date) as 'Valid_From',max(date) as 'Valid_To'
from
(
select
*,
rank() over (partition by Entity,Rating order by date) -
rank() over (partition by Entity order by date) as delta
from
#temp
) as x
group by
Entity,Rating,delta
order by Entity,Valid_From

I ran it against a pretty large data set, and let's just say it's VERY fast- summarizing deltas for around 400 000 records over a year in around 12 seconds without indexes...

Run the query in the brackets separately to see what the rank does. Note that this rank is only for use in grouping by Entity AND Rating.








Matt
2008-08-26
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Jon Whalley wrote:

Recs A B C
13 34% 18% 25%
26 26% 19% 19%
39 21% 21% 15%
52 18% 22% 13%
65 16% 23% 11%

"Draw your own conclusion from this."
...

Sure, I'd draw my own conclusion if that table made any sense, which it doesn't. Percent of what? Costs are what? The costs could be anything from processor load to time in milliseconds to pleasure derived from each method. My conclusion is that your results don't tell me anything at all.

dasv
2008-09-12
re: SQL Server 2005: Using PARTITION and RANK in your criteria
"Sure, I'd draw my own conclusion if that table made any sense, which it doesn't. Percent of what? Costs are what? The costs could be anything from processor load to time in milliseconds to pleasure derived from each method. My conclusion is that your results don't tell me anything at all."

Erm

If you look at a SQL execution plan (hover over a step) and you will see a percentage cost of a query. John Whalley is making the legitimate observation that his approach is substantially more efficient than using rank(), partition or the "B" recommendation.

The percentage cost is the proportionate cost of a step in a query, or a query in a batch of queries within a stored procedure or other t-sql.

Your criticism is both unfounded and betrays an ignorance of SQL Server. "pleasure derived from each method"? just listen to yourself.

"My conclusion is that your results don't tell me anything at all."

My conclusion is you should read-up first before posting criticisms of things you don't understand.

mca
2009-01-30
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Great article, Jeff!

A simple example with a clear explanation.
Thanks!

VK
2009-02-04
re: SQL Server 2005: Using PARTITION and RANK in your criteria
can I use aggregate functions within the rank function? for example, can i do this? Don't worry about the column names. Just wanted to know if I can use an aggregation function in the rank function :-

rank() over(partition by fc.facilityName order by (sum((sd.unitsSold * sd.salePrice) +(sd.unitsSold * sd.salePrice * sd.taxPercentage/100) ) + sum(sa.shippingCharge)) desc) as rank

Peg
2009-02-25
re: SQL Server 2005: Using PARTITION and RANK in your criteria
I found this informative. Also, people should know that with large resultsets, using RANK() and PARTITION is much faster than the alternatives.

Eric
2009-03-09
re: SQL Server 2005: Using PARTITION and RANK in your criteria
I don't think you need to include the insert in your assesment as to which query is faster or more efficient. If you do a SET STATISTICS PROFILE ON and look at the results of just the select statements, what you will see is that the RANK() performs just as good as the nested derived query test or better in some areas.

I would also state that because the RANK is using partitioning...it is creating it's own clustered index on the fly, which is most likely where all of the CPU cycles are being used. This would be less prevalent on an indexed field.

There are more executes in the derived query method and there are aggregates being performed. In a production environment you want to stay away from aggregates whenever possible...speaking from experience.

Lastly in my humble opinion, readability must come into play here. While the derived query shows enginuity and is not all that difficult to read on such a simple example, I would imagine on a more complex example with many more joins involved it will become more difficult for someone to understand, while the RANK() query maintains a degree of elegance and simplicity that lends itself to maintainability.

I give the RANK() query an A+ and the Derived an A. Nice work everyone and a good discussion.

Dave
2009-08-05
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Thank you for this. It got me going with RANK() fairly quickly. Well explained!

Rosie
2009-08-13
re: SQL Server 2005: Using PARTITION and RANK in your criteria
thank you this is great!, quick question can you use RANK() in any edition of sql server 2005?

Krp
2009-09-10
Alternative
In some database like Teradata, you could have done :

select Player, Year, HomeRuns, Rank() over (Partition BY Player order by HomeRuns DESC, Year DESC) as Rank
from Batting
qualify Rank = 1

But "Qualify" doesn't exist in SQL Server !



john
2009-10-03
SQL Server 2005: Using PARTITION and RANK in your criteria
this is good article. i found this through Google.

valli
2009-10-03
SQL Server 2005: Using PARTITION and RANK in your criteria
hi this is a good article, to know about more examples on sql server 2008 rank function please check it out the following link. i hope it will help u lot. SQL Server 2008 Rank Function

Kumar Rajan
2009-10-19
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Its a very nice article.

Thanks.

Can we go for this query,

select b.Player,max(b.Year) Year,b.Team,b.HomeRuns from batting b
inner join (
select a.Player,max(homeruns) as HomeRuns
from batting a group by player) t
on b.player=t.player and b.homeruns=t.homeruns
group by b.Player,b.Team,b.HomeRuns

theworm
2009-11-24
re: SQL Server 2005: Using PARTITION and RANK in your criteria
dasv wrote: [John Whalley is making the legitimate observation that his approach is substantially more efficient than using rank(), partition or the "B" recommendation.
Your criticism is both unfounded and betrays an ignorance of SQL Server. "pleasure derived from each method"? just listen to yourself.
My conclusion is you should read-up first before posting criticisms of things you don't understand.]

create function dbo.theworsefunctionever()
returns bit
as
begin
declare @result bit
select @result = 1
from sys.columns as a
cross join sys.columns as b
cross join sys.columns as c

return(@result)
end
go

--now execute
--the following shows a query cost of 100% and finishes execution in less than 1 sec
select * from sys.columns
where system_type_id = 127
and max_length = 8

--the following shows a query cost of 0% and finishes execution in more than 10 secs
select dbo.theworsefunctionever()


Based on the execution percentage cost [as dasv wrote:]the "LEGITIMATE" result is that the second query is "substantially more efficient"

smattering is dangerous, right dasv?


nommis
2009-11-25
re: SQL Server 2005: Using PARTITION and RANK in your criteria
A comparison of execution plan output is meaningless if the candidate queries are doing something completely different.

But if you compare the execution plans for two queries that produce the same result set, as John Whalley was, then it is very meaningful.

theworm
2009-11-25
re: SQL Server 2005: Using PARTITION and RANK in your criteria
nommis you could have a look at the following link:
http://stackoverflow.com/questions/564717/measuring-query-performance-execution-plan-query-cost-vs-time-taken
if you want to compare statements based on their relative cost, then it is your choice as long as you can justify it and support it.
Btw, have you ever seen any serious "post" showing comparison results in relative costs ?

rd
2009-12-30
re: SQL Server 2005: Using PARTITION and RANK in your criteria
8/19/2008 2:20 PM | Mike Renwick

..."I figured by subtracting one from the other, I get a number that stays the same as long as the rows are contiguous, hence gives me a means to aggregate to find the max and minimum"...


Mike,
Your post regarding creating the delta column was very insightful and helped solving a similar problem.

Thanks.


Luna
2010-01-19
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Any ideas on using this in MS Access?

Vaibhav
2010-02-03
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Hi jeff,

Great...

I was struggling with some very complicated queries and this article gave me clear understanding about ranking.
can you please send another article or example for dense_rank.

Thanks in advance.

Aaron
2010-02-10
re: SQL Server 2005: Using PARTITION and RANK in your criteria
This is brilliant!

Aaron
2010-02-10
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Here's on for you. I ran the following statement dozens of times and it worked great. Then, I changed the where clause and it stopped working. Any thoughts? I'm getting an error "MSG 258, Level 15, State 1, Line 3 Cannot call methods on decimal."

SELECT * FROM
(
SELECT H.location_id. L.item_id AS po_item, H.po_no, H.order_date, H.created_by,
RANK() OVER (PARTITION BY H.location_id, L.item_id ORDER BY H.location_id, L.item_id, H.order_date DESC) AS Rank
FROM p21_view_po_hdr H INNER JOIN p21_view_po_line L
ON H.po_no = L.po_no
) tmp
WHERE Rank = 1 AND po_item = '6G-6MP'

Like I said, it worked great for about 1 dozen tries then it just stopped working.

Aaron
2010-02-10
re: SQL Server 2005: Using PARTITION and RANK in your criteria
You know, sometimes if you just read. I missed my type-o on the 1st line. Genius.

jayanti
2010-02-25
re: SQL Server 2005: Using PARTITION and RANK in your criteria
if i wan t to be in format as


player year team ..............................

A 2001
2002
2003
2004
2005
b 2001
2002
2003
2004
i am not able to do as one to many as above



Del Duncan
2010-04-01
re: SQL Server 2005: Using PARTITION and RANK in your criteria
I just now came across this older article, but what a great explanation. I'm actually using an internal query builder package and it uses the Rank/Partition method and doesn't include any rownum, top 1 type of functions, so this explanation was great and I was quickly able to build a test query outside of the query builder that is actually faster than my test original query using a derived table to return a rownum from. Great stuff!

Shane
2010-04-01
re: SQL Server 2005: Using PARTITION and RANK in your criteria
Good article.

We have a few packages that have case sensitive databases... any chance you could make my life easier the next time!

Alvin
2010-10-06
Mr
Hi Jeff,

Could you show me how to get median (by using PARTITION and RANK)?

Thanks
Alvin

Osei
2010-10-07
Finding last 3 pay dates using PARTITION and RANK in your criteria
I'm trying to create a query that displays the last 3 payment dates of our customers. I'm trying the following and I am getting an error message for an invalid identifier. Can you show me where I've made my mistake?

Thanks in advance,

Osei

SELECT
L.CUST_NUMBER AS ACCT_ID, C1.CUST_LNAME
MAX(CASE WHEN DATE_RANK = 3 THEN C.CASH_DRWR_DATE END) AS THIRD_TO_LAST_PMT_DATE,
MAX(CASE WHEN DATE_RANK = 2 THEN C.CASH_DRWR_DATE END) AS SECOND_TO_LAST_PMT_DATE,
MAX(CASE WHEN DATE_RANK = 1 THEN C.CASH_DRWR_DATE END) AS LAST_PMT_DATE
FROM
(SELECT UNIQUE
L.CUST_NUMBER AS ACCT_ID, C1.CUST_LNAME, C.CASH_DRWR_DATE,
DENSE_RANK() OVER (PARTITION BY L.CUST_NUMBER ORDER BY C.CASH_DRWR_DATE DESC) AS DATE_RANK
FROM
AMSUSER.LOAN L
LEFT OUTER JOIN AMSUSER.CUSTHIST C ON L.CUST_NUMBER = C.CUST_NUMBER
LEFT OUTER JOIN AMSUSER.CUSTOMER C1 ON L.CUST_NUMBER = C1.CUST_NUMBER
WHERE L.LOAN_TYPE <> 'IS' AND
L.LOAN_STATUS IN ( 'O')) X
GROUP BY L.CUST_NUMBER
ORDER BY C1.CUST_LNAME ASC