More on Runs and Streaks in SQL
That's right boys and girls, it's what you've been waiting for all weekend: Another edition of the mailbag!
Damian writes:
Hi
I have a tricky SQL question that I have been trawling the net and
workmates to find an answer. We are accessing a real time proprietary
database that approximates classic SQL in syntax. The GUI we are using does
not allow anything to be done effectively on the front end. It all has
to happen in the query.
Trades flow in and we would like to sum them on the fly with only a
query. We need to sum based on the change in price.
eg
for this series of trades (ID is ordered but non-contiguous)
Trade ID Price Volume
1 4 10
3 4 20
6 5 15
7 4 20
should produce
price volume
4 30
5 15
4 20
so it sums the volume on a change in price. but note if the same price
appears again this is not included in the total for the first instance
of that price but in the total for a new instance of that price.
It is so simple but the concensus is it is impossible in pure SQL. At
first I did not think it was but now I tend to agree. we cannot change
the tables in the database as it is all locked away and front end
processing is unworkable.
would you agree?
thanks
Damian
I have a tricky SQL question that I have been trawling the net and
workmates to find an answer. We are accessing a real time proprietary
database that approximates classic SQL in syntax. The GUI we are using does
not allow anything to be done effectively on the front end. It all has
to happen in the query.
Trades flow in and we would like to sum them on the fly with only a
query. We need to sum based on the change in price.
eg
for this series of trades (ID is ordered but non-contiguous)
Trade ID Price Volume
1 4 10
3 4 20
6 5 15
7 4 20
should produce
price volume
4 30
5 15
4 20
so it sums the volume on a change in price. but note if the same price
appears again this is not included in the total for the first instance
of that price but in the total for a new instance of that price.
It is so simple but the concensus is it is impossible in pure SQL. At
first I did not think it was but now I tend to agree. we cannot change
the tables in the database as it is all locked away and front end
processing is unworkable.
would you agree?
thanks
Damian
It is a little tricky, and it is not terribly efficient, but it can be done in pure set-based SQL if I am understanding your requirements correctly.
All you need is the technique shown here.
(As far as I know, until I hear otherwise, I'll take credit for coming up with this method! )
Here the code:
-- set up sample table:
create table Trades (TradeID int primary key, Price int, Volume int)
insert into Trades
select 1,4,10 union all
select 3,4,20 union all
select 6,5,15 union all
select 7,4,20
-- your solution:
select
min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
select t1.TradeID, t1.Price, t1.Volume,
(select count(*) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup
from Trades t1
) x
group by
Price, RunGroup
order by
min(tradeID)
-- results:
StartID EndID Price TotalVolume
----------- ----------- ----------- -----------
1 3 4 30
6 6 5 15
7 7 4 20
(3 row(s) affected)
create table Trades (TradeID int primary key, Price int, Volume int)
insert into Trades
select 1,4,10 union all
select 3,4,20 union all
select 6,5,15 union all
select 7,4,20
-- your solution:
select
min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
select t1.TradeID, t1.Price, t1.Volume,
(select count(*) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup
from Trades t1
) x
group by
Price, RunGroup
order by
min(tradeID)
-- results:
StartID EndID Price TotalVolume
----------- ----------- ----------- -----------
1 3 4 30
6 6 5 15
7 7 4 20
(3 row(s) affected)
As I mentioned, this works, but it is not very efficient; depending on how much data you have, it might run very slowly. The culprit is that correlated sub-query that calculates the "RunGroup". You can also calculate the RunGroup using either of these techniques if they are more efficient for your data:
-- option 2
select *, (select min(tradeID) from Trades t2
where t2.TradeID > t1.TradeID and t2.Price != t1.Price) as RunGroup
from Trades t1
-- option 3
select *, (select max(TradeID) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup
from Trades t1
select *, (select min(tradeID) from Trades t2
where t2.TradeID > t1.TradeID and t2.Price != t1.Price) as RunGroup
from Trades t1
-- option 3
select *, (select max(TradeID) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup
from Trades t1
For this small sample table, the execution plan is the same for all 3.
Also, note that I am tracking the "RunGroups" for all of the data in your table, but if it can be partitioned -- for example, by Customer or Client -- then that will make it much more efficient. To add a partition on, say, Client, you would alter the code like this:
select
Client, min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
select t1.Client, t1.TradeID, t1.Price, t1.Volume,
(select count(*) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price
and t1.Client = t2.Client) as RunGroup
from Trades t1
) x
group by
Client, Price, RunGroup
order by
Client, min(tradeID)
Client, min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume
from
(
select t1.Client, t1.TradeID, t1.Price, t1.Volume,
(select count(*) from Trades t2
where t2.TradeID < t1.TradeID and t2.Price <> t1.Price
and t1.Client = t2.Client) as RunGroup
from Trades t1
) x
group by
Client, Price, RunGroup
order by
Client, min(tradeID)
I hope this helps! if not, let me know.
Legacy Comments
Radha Mukkai
2007-05-30 |
re: More on Runs and Streaks in SQL Jeff, If we change the data to: TradeID Price Volume 1 4 10 3 4 20 6 5 15 7 4 20 8 4 30 The results are: StartID EndID Price Volume 1 3 4 30 6 6 5 15 7 8 4 50 The expected results based on the problem definition (volume should be summed only on a change in price) StartID EndID Price Volume 1 3 4 30 6 6 5 15 7 7 4 20 8 8 4 30 We will have to change the correlated sub-query RunGroup to get the desired results. However, I am not sure what change needs to be made! I am looking into it, though. Thanks, Radha Mukkai |
Jeff
2007-05-30 |
re: More on Runs and Streaks in SQL Radha -- your expected results do not match your requirements. You said: >>volume should be summed only on a change in price yet, with your new set of sample data, there is no change in price between ID 7 and 8. Therefore, based on your stated requirements, they should be summed together into 1 row, which is what is returned in the results. Your expected results do not show this behavior. Why should 7 and 8 *not* be aggregated together ?? In other words, why are 1 and 3 summed together, but not 7 and 8 in your expected results? Nothing in your requirements indicates why, and the data itself shows no difference between the two situations as well. |
Radha Mukkai
2007-05-30 |
re: More on Runs and Streaks in SQL Jeff, TradeIDs' 1 and 3 are summed together because there was a change in price (in TradeID: 6). However, with Trade ID's 7 and 8 there was no change in price and hence the Volumes will NOT be summed. This is what I think the requirements are: If there is a change in price, we sum the volume of the previous records. For example: We start at TradeID 1. This is the initial price. Go to TradeID 3. No price change. Go to TradeID 6. There is a price change (from 4 to 5). Sum all the previous records (2 records total - Sum IDs 1 and 3). Go to TradeID 7. There is another price change (from 5 to 4). Sum all the previous records (1 record total - Sum ID 6) Go to TradeID 8. No price change. Was my explanation clear? Does it make sense? Let me know. Thanks, Radha Mukkai |
Jeff
2007-05-30 |
re: More on Runs and Streaks in SQL No, not really clear at all. My fear is that if I solve this latest iteration, the requirements and sample data are going to change again. Let's do this backwards: I'll try to give *you* the *requirements*, and let me know if they are correct. "For all consecutive rows with the same Price (when ordered by tradeID), if a row exists with a higher TradeID and a different Price, group them all together and calculate the sum(Volume) for that group. If there does *not* exist a row with a higher tradeID and a different price, leave the rows ungrouped." Is that correct? If so, then you need to add an additional "grouping" column that returns a constant if the rows should be grouped together, or the TradeID if not. this will make the query even less efficient, of course. Something like this: select min(tradeID) as StartID, max(TradeID) as EndID, Price, sum(Volume) as TotalVolume from ( select t1.TradeID, t1.Price, t1.Volume, (select count(*) from Trades t2 where t2.TradeID < t1.TradeID and t2.Price <> t1.Price) as RunGroup, case when exists(select * from Trades t3 where t3.TradeID > t1.tradeID and t3.Price <> t1.Price) then 0 else TradeID end as RunGroup2 from Trades t1 ) x group by Price, RunGroup, RunGroup2 order by min(tradeID) which gives you the results that you want. the results of which don't really make any logical sense, but if that's what you need, then that's what you need I suppose. |
Hugo Kornelis
2007-05-30 |
re: More on Runs and Streaks in SQL Hi Jeff, Here's a method that is lots quicker, since it uses a single pass over the data. I can only claim credit for remembering and recoonstructing the technique, I first saw it in a publication by Itzik Ben_Gan. here's the query: WITH Groups AS (SELECT TradeID, Price, Volume, ROW_NUMBER() OVER (ORDER BY TradeID) AS rn, ROW_NUMBER() OVER (PARTITION BY Price ORDER BY TradeID) AS rnP FROM Trades) SELECT MIN(TradeID) AS StartID, MAX(TradeID) AS EndID, Price, SUM(Volume) AS TotalVolume FROM Groups GROUP BY Price, rn-rnP; SQL Server 2005 only, unfortunately,.. Best, Hugo |
Jeff
2007-05-30 |
re: More on Runs and Streaks in SQL Very interesting! I will have to investigate further. I've suspected that you can do this more efficiently with partitions, but never really got a chance to try it out. Unfortunately, it does require 2005, and it doesn't solve the "updated" requirements (see the comments). Thanks, Hugo!! |
Damian
2007-05-31 |
re: More on Runs and Streaks in SQL Thanks all for your suggestions. Sorry for not being more clear on the specs Radha, it was intended as Jeff interpreted, but when I read your post I could see how it sounded like what you said. Looks like RunGroups is the way to go. Unfortunately it is not currently implemented in our proprietary db, but it is something that they will consider implementing, so it could be the beginnings of our solution. Like the elegance of partition solution Hugo, but that sort of sophistication for our vendor is a pipedream! Regards Damian |
Hugo Kornelis
2007-05-31 |
re: More on Runs and Streaks in SQL Hi Jeff, Indeed, the SQL 2005 solution I posted was targeted at the original requirements. In fact, I believe that the "updated" requirements are caused by a misinterpretation of the original requirements (but I might be wrong). Anyway, here's a variation on Itzik's technique that will produce the answer to the updated requirements, still using a single pass over the data (and yes, I can claim credit for coming up with this variation myself). The only thing left to do know is to find a practical use for it :-) WITH Groups AS (SELECT TradeID, Price, Volume, ROW_NUMBER() OVER (ORDER BY TradeID DESC) AS rn, ROW_NUMBER() OVER (PARTITION BY Price ORDER BY TradeID DESC) AS rnP FROM Trades) SELECT MIN(TradeID) AS StartID, MAX(TradeID) AS EndID, Price, SUM(Volume) AS TotalVolume FROM Groups GROUP BY Price, rn-rnP, COALESCE(NULLIF(rn-rnP,0),rn); |
Jeff
2007-05-31 |
re: More on Runs and Streaks in SQL Thanks for the clarification, Damian -- I was not sure if Radha was a co-worker of yours or something who knew more about the requirements, so I spent a more time than I should have trying to satisfy his fairly illogical (why would you ever want to return results like that???) specs. |
Radha Mukkai
2007-05-31 |
re: More on Runs and Streaks in SQL Jeff, Looks like I did not understand the requirements correctly. Quick Clarification: I am not Damien's co-worker. The wording, "it sums the volume on a change in price", threw me off a bit. Sorry for the confusion! Thanks, Radha |
Dave Null
2008-02-05 |
Runs and Streaks, grouping and ranking Hi Jeff, I'm looking for a set-based solution to this problem. I have a table with about 250K records on an SQL Server 2005 as follows: CREATE TABLE grid ( gridID int identity(1,1) not nul l-- also the primary key ,EventDate datetime int not null -- dates not contiguous ,PersonID int not null -- about 5K unique IDs ,StartPosition smallint not null -- range 1 to 100 ,FinishPosition smallint not null -- range 1 to 100 ) I need top 10 people with --the most consecutive StartPositions=1 with start and end dates --the most consecutive FinishPositions=1 with start and end dates --the most consecutive StartPositions<10 with start and end dates --the most consecutive FinishPositions<10 with start and end dates Before I embark on a cursor-based coding exercise, I wanted to see if there's a set-based approach. I've used rank and dense_rank before but don't know how to combine ranking with grouping and streaks. Thanks for any help. Dave Null |