Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

More on Runs and Streaks in SQL

That's right boys and girls, it's what you've been waiting for all weekend:  Another edition of the mailbag!

Damian writes:

Hi
I have a tricky SQL question that I have been trawling the net and
workmates to find an answer. We are accessing a real time proprietary
database that approximates classic SQL in syntax. The GUI we are using does
not allow anything to be done effectively on the front end. It all has
to happen in the query.
Trades flow in and we would like to sum them on the fly with only a
query. We need to sum based on the change in price.
eg
for this series of trades (ID is ordered but non-contiguous)
Trade ID   Price   Volume
1              4       10
3              4       20
6              5       15
7              4       20
should produce

price  volume
4      30
5      15
4      20

so it sums the volume on a change in price. but note if the same price
appears again this is not included in the total for the first instance
of that price but in the total for a new instance of that price.

It is so simple but the concensus is it is impossible in pure SQL. At
first I did not think it was but now I tend to agree. we cannot change
the tables in the database as it is all locked away and front end
processing is unworkable.
would you agree?
thanks
Damian

It is a little tricky, and it is not terribly efficient, but it can be done in pure set-based SQL if I am understanding your requirements correctly.

All you need is the technique shown here

(As far as I know, until I hear otherwise, I'll take credit for coming up with this method! )

Here the code:

-- set up sample table:

create table Trades (TradeID int primary key, Price int, Volume int)

insert into Trades
select 1,4,10 union all
select 3,4,20 union all
select 6,5,15 union all
select 7,4,20

-- your solution:

select
  min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
  select t1.TradeID, t1.Price, t1.Volume,
    (select count(*) from Trades t2
     where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup
  from Trades t1
) x
group by
  Price, RunGroup
order by
  min(tradeID)

-- results:

StartID     EndID       Price       TotalVolume
----------- ----------- ----------- -----------
1           3           4           30
6           6           5           15
7           7           4           20

(3 row(s) affected)

As I mentioned, this works, but it is not very efficient; depending on how much data you have, it might run very slowly.  The culprit is that correlated sub-query that calculates the "RunGroup".   You can also calculate the RunGroup using either of these techniques if they are more efficient for your data:

-- option 2

select *, (select min(tradeID) from  Trades t2
           where t2.TradeID > t1.TradeID and t2.Price != t1.Price) as RunGroup
from Trades t1

-- option 3

select *, (select max(TradeID) from Trades t2
           where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup
from Trades t1

For this small sample table, the execution plan is the same for all 3.

Also, note that I am tracking the "RunGroups" for all of the data in your table, but if it can be partitioned -- for example, by Customer or Client -- then that will make it much more efficient.  To add a partition on, say, Client, you would alter the code like this:

select
  Client, min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
  select t1.Client, t1.TradeID, t1.Price, t1.Volume,
    (select count(*) from Trades t2
     where t2.TradeID < t1.TradeID and t2.Price <> t1.Price
           and t1.Client = t2.Client) as RunGroup

  from Trades t1
) x
group by
  Client, Price, RunGroup
order by
  Client, min(tradeID)

I hope this helps!  if not, let me know.

Legacy Comments


Radha Mukkai
2007-05-30
re: More on Runs and Streaks in SQL
Jeff,

If we change the data to:

TradeID Price Volume
1 4 10
3 4 20
6 5 15
7 4 20
8 4 30

The results are:
StartID EndID Price Volume
1 3 4 30
6 6 5 15
7 8 4 50

The expected results based on the problem definition (volume should be summed only on a change in price)
StartID EndID Price Volume
1 3 4 30
6 6 5 15
7 7 4 20
8 8 4 30

We will have to change the correlated sub-query RunGroup to get the desired results. However, I am not sure what change needs to be made! I am looking into it, though.

Thanks,
Radha Mukkai

Jeff
2007-05-30
re: More on Runs and Streaks in SQL
Radha -- your expected results do not match your requirements. You said:

>>volume should be summed only on a change in price

yet, with your new set of sample data, there is no change in price between ID 7 and 8. Therefore, based on your stated requirements, they should be summed together into 1 row, which is what is returned in the results.

Your expected results do not show this behavior. Why should 7 and 8 *not* be aggregated together ??

In other words, why are 1 and 3 summed together, but not 7 and 8 in your expected results? Nothing in your requirements indicates why, and the data itself shows no difference between the two situations as well.

Radha Mukkai
2007-05-30
re: More on Runs and Streaks in SQL
Jeff,

TradeIDs' 1 and 3 are summed together because there was a change in price (in TradeID: 6).

However, with Trade ID's 7 and 8 there was no change in price and hence the Volumes will NOT be summed.

This is what I think the requirements are:
If there is a change in price, we sum the volume of the previous records.

For example:
We start at TradeID 1. This is the initial price.
Go to TradeID 3. No price change.
Go to TradeID 6. There is a price change (from 4 to 5). Sum all the previous records (2 records total - Sum IDs 1 and 3).
Go to TradeID 7. There is another price change (from 5 to 4). Sum all the previous records (1 record total - Sum ID 6)
Go to TradeID 8. No price change.

Was my explanation clear? Does it make sense? Let me know.

Thanks,
Radha Mukkai

Jeff
2007-05-30
re: More on Runs and Streaks in SQL
No, not really clear at all. My fear is that if I solve this latest iteration, the requirements and sample data are going to change again. Let's do this backwards: I'll try to give *you* the *requirements*, and let me know if they are correct.

"For all consecutive rows with the same Price (when ordered by tradeID), if a row exists with a higher TradeID and a different Price, group them all together and calculate the sum(Volume) for that group. If there does *not* exist a row with a higher tradeID and a different price, leave the rows ungrouped."

Is that correct? If so, then you need to add an additional "grouping" column that returns a constant if the rows should be grouped together, or the TradeID if not. this will make the query even less efficient, of course.

Something like this:

select min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
select t1.TradeID, t1.Price, t1.Volume,
(select count(*) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup,
case when exists(select * from Trades t3 where t3.TradeID > t1.tradeID and t3.Price <> t1.Price) then 0 else TradeID end as RunGroup2
from Trades t1
) x
group by Price, RunGroup, RunGroup2
order by min(tradeID)

which gives you the results that you want. the results of which don't really make any logical sense, but if that's what you need, then that's what you need I suppose.


Hugo Kornelis
2007-05-30
re: More on Runs and Streaks in SQL
Hi Jeff,

Here's a method that is lots quicker, since it uses a single pass over the data. I can only claim credit for remembering and recoonstructing the technique, I first saw it in a publication by Itzik Ben_Gan. here's the query:

WITH Groups
AS
(SELECT TradeID, Price, Volume,
ROW_NUMBER() OVER (ORDER BY TradeID) AS rn,
ROW_NUMBER() OVER (PARTITION BY Price ORDER BY TradeID) AS rnP
FROM Trades)
SELECT MIN(TradeID) AS StartID, MAX(TradeID) AS EndID, Price,
SUM(Volume) AS TotalVolume
FROM Groups
GROUP BY Price, rn-rnP;

SQL Server 2005 only, unfortunately,..

Best, Hugo

Jeff
2007-05-30
re: More on Runs and Streaks in SQL
Very interesting! I will have to investigate further. I've suspected that you can do this more efficiently with partitions, but never really got a chance to try it out.

Unfortunately, it does require 2005, and it doesn't solve the "updated" requirements (see the comments).

Thanks, Hugo!!

Damian
2007-05-31
re: More on Runs and Streaks in SQL
Thanks all for your suggestions.

Sorry for not being more clear on the specs Radha, it was intended as Jeff interpreted, but when I read your post I could see how it sounded like what you said.

Looks like RunGroups is the way to go. Unfortunately it is not currently implemented in our proprietary db, but it is something that they will consider implementing, so it could be the beginnings of our solution.

Like the elegance of partition solution Hugo, but that sort of sophistication for our vendor is a pipedream!
Regards
Damian

Hugo Kornelis
2007-05-31
re: More on Runs and Streaks in SQL
Hi Jeff,


Indeed, the SQL 2005 solution I posted was targeted at the original requirements. In fact, I believe that the "updated" requirements are caused by a misinterpretation of the original requirements (but I might be wrong).

Anyway, here's a variation on Itzik's technique that will produce the answer to the updated requirements, still using a single pass over the data (and yes, I can claim credit for coming up with this variation myself). The only thing left to do know is to find a practical use for it :-)

WITH Groups AS
(SELECT TradeID, Price, Volume,
ROW_NUMBER() OVER (ORDER BY TradeID DESC) AS rn,
ROW_NUMBER() OVER (PARTITION BY Price ORDER BY TradeID DESC) AS rnP
FROM Trades)
SELECT MIN(TradeID) AS StartID, MAX(TradeID) AS EndID, Price, SUM(Volume) AS TotalVolume
FROM Groups
GROUP BY Price, rn-rnP, COALESCE(NULLIF(rn-rnP,0),rn);

Jeff
2007-05-31
re: More on Runs and Streaks in SQL
Thanks for the clarification, Damian -- I was not sure if Radha was a co-worker of yours or something who knew more about the requirements, so I spent a more time than I should have trying to satisfy his fairly illogical (why would you ever want to return results like that???) specs.

Radha Mukkai
2007-05-31
re: More on Runs and Streaks in SQL
Jeff,

Looks like I did not understand the requirements correctly.

Quick Clarification: I am not Damien's co-worker. The wording, "it sums the volume on a change in price", threw me off a bit.

Sorry for the confusion!

Thanks,
Radha

Dave Null
2008-02-05
Runs and Streaks, grouping and ranking
Hi Jeff,

I'm looking for a set-based solution to this problem. I have a table with about 250K records on an SQL Server 2005 as follows:

CREATE TABLE grid
(
gridID int identity(1,1) not nul l-- also the primary key
,EventDate datetime int not null -- dates not contiguous
,PersonID int not null -- about 5K unique IDs
,StartPosition smallint not null -- range 1 to 100
,FinishPosition smallint not null -- range 1 to 100
)

I need top 10 people with
--the most consecutive StartPositions=1 with start and end dates
--the most consecutive FinishPositions=1 with start and end dates
--the most consecutive StartPositions<10 with start and end dates
--the most consecutive FinishPositions<10 with start and end dates

Before I embark on a cursor-based coding exercise, I wanted to see if there's a set-based approach. I've used rank and dense_rank before but don't know how to combine ranking with grouping and streaks. Thanks for any help.

Dave Null