The Running Totals problem is as old as accounting. In SQL Server there are different ways of calculating it and the general consensus is that it is one of the few problems best handled with a cursor. I still say it’s best handled in the presentation layer though. Being the SQL geek I am I can’t accept a problem in SQL Server which has a cursor for a solution (just kidding).
Note that I didn’t put any indexes on the tables so we can’t rely on them for any kind of ordering. The base table has 10 million rows and we’re going to aggregate on a 1 million row subset. For shortness sake the test table consists of a single column that we will aggregate on. And if you really need to output running totals of 1 million rows on the fly you’re doing something wrong. :)
IF OBJECT_ID('tempdb..#baseTable') IS NOT NULL
DROP TABLE #baseTable
-- create a 10 million row base table
-- we need running totals for 1 million rows out of 10 million
SELECT TOP 10000000 -- 10 million row table
ROW_NUMBER() OVER (ORDER BY t1.Number) AS Number
INTO #baseTable
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
The ordered CTE Update solution
In SQL Server 2005 Common Table Expressions (CTE) were introduced. They work like a view inside your select statement. What is cool and not well known about them is that you can update, insert and delete their results without joining them to other tables. I used a TOP (Int Max value) trick because the TOP 100 Percent syntax is considered harmful. Because of this I didn’t want to rely on it.
The next gem is the OUTPUT clause which was also introduced in SQL Server 2005. It outputs changed data from an update, delete or an insert statement into a result set so you don’t need to do another select.
First we populate the temp table with the 1 million rows subset we wish and add an extra RT column which will hold the Running total value. Next we create an ordered CTE, update it with “the direct variable and column update at the same time” trick and directly output the updated rows into the returning result set.
IF OBJECT_ID('tempdb..#testCTE') IS NOT NULL
DROP TABLE #testCTE
DECLARE @dtStart DATETIME
SELECT @dtStart = GETDATE()
-- create our temp table to return data from
SELECT TOP 1000000 Number, CAST(0 AS BIGINT) RT
INTO #testCTE
FROM #baseTable
ORDER BY Number
-- ORDER BY NEWID() -- also works just fine
-- declare helper variable
DECLARE @RT BIGINT
SELECT @RT = 0
;WITH RunningTotals AS
(
SELECT TOP (2147483647) -- put a int max value here to get all rows
Number, RT
FROM #testCTE
ORDER BY Number
-- both of there work correctly
-- ORDER BY Number DESC
-- ORDER BY NEWID()
)
UPDATE RunningTotals
SET @RT = RT = @RT + Number
OUTPUT inserted.*
SELECT DATEDIFF(s, @dtStart, GETDATE()) AS DurationInSeconds
The query ran in 8 seconds which blew my mind. I tried various orderings of the column, putting clustered and nonclustered indexes, ordering insert into #testCTE table by newid() but I couldn’t find any example where the ordering would not be honored. Even fragmenting the index didn’t break it.
If anyone could provide an example when the order by isn’t honored I’d be very pleased. Being proved wrong is kind of fun :)
UPDATE: Problems with clustered index
After playing some more and discussing this with Jeff Moden (hoped he would drop in :)) the problem with adding a clustered index is that the clustered index is always updated in the order of the clustering key. It doesn’t matter on which column it is. We can verify this easily by doing this:
SELECT TOP 1000000 Number, CAST(0 AS BIGINT) RT
INTO #testCTE
FROM #baseTable
ORDER BY Number
CREATE CLUSTERED INDEX CI_testCTE_Number ON #testCTE(Number)
If we run the upper CTE after adding a clustered index we can see that different ordering inside the CTE makes no difference since the update runs in the order of the clustering key. For this to work the #testCTE should NOT have a clustered index on it. Non clustered indexes are ok and they make no difference since the ordering inside the CTE is respected.
And please leave the running totals for the client to do. It’s its job. :)
The Cursor and Correlated subquery solutions
For completeness sake here are the cursor and correlated subquery solutions. Cursor solution took 48 seconds and I just killed the correlated subquery after 5 minutes.
-- Cursor solution
IF OBJECT_ID('tempdb..#testCursor') IS NOT NULL
DROP TABLE #testCursor
-- create our empty temp table to return data from
DECLARE @dtStart DATETIME
SELECT @dtStart = GETDATE()
SELECT TOP 1 Number, CAST(0 AS BIGINT) RT
INTO #testCursor
FROM #baseTable
WHERE 1=0
DECLARE RTCursor CURSOR FOR
SELECT TOP 1000000 Number FROM #baseTable ORDER BY Number
DECLARE @Number INT, @RT BIGINT
SELECT @RT = 0
OPEN RTCursor
WHILE (0=0) BEGIN
FETCH NEXT FROM RTCursor INTO @Number
IF @@FETCH_STATUS <> 0 BREAK
SET @RT = @RT + @Number
INSERT INTO #testCursor(Number, RT)
VALUES(@Number, @RT )
END
CLOSE RTCursor
DEALLOCATE RTCursor
SELECT *
FROM #testCursor
ORDER BY Number
SELECT DATEDIFF(s, @dtStart, GETDATE()) AS DurationInSeconds
GO
-- Correlated subquery method
SELECT TOP 1000 Number, (SELECT SUM(Number) FROM #baseTable WHERE Number <= t1.Number) AS RT
FROM #baseTable t1
ORDER BY Number
I haven’t tried the SQL CLR solution for this so if anyone has any benchmarks I’d be glad to hear them.