Mladen Prajdić Blog

Blog about stuff and things and stuff. Mostly about SQL server and .Net

SQL Server 2005 – Fast Running Totals solution with ordered CTE update?

The Running Totals problem is as old as accounting. In SQL Server there are different ways of calculating it and the general consensus is that it is one of the few problems best handled with a cursor. I still say it’s best handled in the presentation layer though. Being the SQL geek I am I can’t accept a problem in SQL Server which has a cursor for a solution (just kidding).
Note that I didn’t put any indexes on the tables so we can’t rely on them for any kind of ordering. The base table has 10 million rows and we’re going to aggregate on a 1 million row subset. For shortness sake the test table consists of a single column that we will aggregate on. And if you really need to output running totals of 1 million rows on the fly you’re doing something wrong. :)
 
IF OBJECT_ID('tempdb..#baseTable') IS NOT NULL
    DROP TABLE #baseTable
-- create a 10 million row base table
-- we need running totals for 1 million rows out of 10 million
SELECT TOP 10000000 -- 10 million row table
       ROW_NUMBER() OVER (ORDER BY t1.Number) AS Number
INTO   #baseTable
FROM   master..spt_values t1
       CROSS JOIN master..spt_values t2
       CROSS JOIN master..spt_values t3

The ordered CTE Update solution

In SQL Server 2005 Common Table Expressions (CTE) were introduced. They work like a view inside your select statement. What is cool and not well known about them is that you can update, insert and delete their results without joining them to other tables. I used a TOP (Int Max value) trick because the TOP 100 Percent syntax is considered harmful. Because of this I didn’t want to rely on it.

The next gem is the OUTPUT clause which was also introduced in SQL Server 2005. It outputs changed data from an update, delete or an insert statement into a result set so you don’t need to do another select.

First we populate the temp table with the 1 million rows subset we wish and add an extra RT column which will hold the Running total value. Next we create an ordered CTE, update it with “the direct variable and column update at the same time” trick and directly output the updated rows into the returning result set.

IF OBJECT_ID('tempdb..#testCTE') IS NOT NULL
    DROP TABLE #testCTE

DECLARE @dtStart DATETIME SELECT @dtStart = GETDATE()

– create our temp table to return data from SELECT TOP 1000000 Number, CAST(0 AS BIGINT) RT INTO #testCTE FROM #baseTable ORDER BY Number

-- ORDER BY NEWID() -- also works just fine

-- declare helper variable
DECLARE @RT BIGINT
SELECT @RT = 0

;WITH RunningTotals AS
(
    SELECT TOP (2147483647) -- put a int max value here to get all rows
           Number, RT
    FROM   #testCTE
    ORDER BY Number
    -- both of there work correctly
    -- ORDER BY Number DESC
    -- ORDER BY NEWID()
)
UPDATE RunningTotals 
SET       @RT = RT = @RT + Number
OUTPUT inserted.*

SELECT DATEDIFF(s, @dtStart, GETDATE()) AS DurationInSeconds

 

The query ran in 8 seconds which blew my mind. I tried various orderings of the column, putting clustered and nonclustered indexes, ordering insert into #testCTE table by newid() but I couldn’t find any example where the ordering would not be honored. Even fragmenting the index didn’t break it.

If anyone could provide an example when the order by isn’t honored I’d be very pleased. Being proved wrong is kind of fun :)

UPDATE: Problems with clustered index

After playing some more and discussing this with Jeff Moden (hoped he would drop in :)) the problem with adding a clustered index is that the clustered index is always updated in the order of the clustering key. It doesn’t matter on which column it is. We can verify this easily by doing this:

SELECT TOP 1000000 Number, CAST(0 AS BIGINT) RT
INTO   #testCTE
FROM   #baseTable
ORDER BY Number

CREATE CLUSTERED INDEX CI_testCTE_Number ON #testCTE(Number)

 

If we run the upper CTE after adding a clustered index we can see that different ordering inside the CTE makes no difference since the update runs in the order of the clustering key. For this to work the #testCTE should NOT have a clustered index on it. Non clustered indexes are ok and they make no difference since the ordering inside the CTE is respected.

And please leave the running totals for the client to do. It’s its job. :)

The Cursor and Correlated subquery solutions

For completeness sake here are the cursor and correlated subquery solutions. Cursor solution took 48 seconds and I just killed the correlated subquery after 5 minutes.

-- Cursor solution
IF OBJECT_ID('tempdb..#testCursor') IS NOT NULL
    DROP TABLE #testCursor
-- create our empty temp table to return data from
DECLARE @dtStart DATETIME
SELECT  @dtStart = GETDATE()

SELECT TOP 1 Number, CAST(0 AS BIGINT) RT INTO #testCursor FROM #baseTable WHERE 1=0

DECLARE RTCursor CURSOR FOR SELECT TOP 1000000 Number FROM #baseTable ORDER BY Number

DECLARE @Number INT, @RT BIGINT SELECT @RT = 0

OPEN RTCursor WHILE (0=0) BEGIN FETCH NEXT FROM RTCursor INTO @Number

<span class="kwrd">IF</span> <span class="preproc">@@FETCH_STATUS</span> &lt;&gt; 0 <span class="kwrd">BREAK</span> 

<span class="kwrd">SET</span> @RT = @RT + @Number

INSERT <span class="kwrd">INTO</span> #testCursor(Number, RT) 
<span class="kwrd">VALUES</span>(@Number, @RT ) 

END CLOSE RTCursor DEALLOCATE RTCursor

SELECT * FROM #testCursor ORDER BY Number

SELECT DATEDIFF(s, @dtStart, GETDATE()) AS DurationInSeconds GO

– Correlated subquery method SELECT TOP 1000 Number, (SELECT SUM(Number) FROM #baseTable WHERE Number <= t1.Number) AS RT FROM #baseTable t1 ORDER BY Number

 

 

I haven’t tried the SQL CLR solution for this so if anyone has any benchmarks I’d be glad to hear them.

kick it on DotNetKicks.com

Legacy Comments


Brian Tkatch
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
There's also the recursive CTE. But that takes forever as well.

No numbers, it's not done yet. :)

Mladen
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
tried the recursive CTE it does take forever and it can reach the maxrecursion limit easily.
so i don't like it :)

csm
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
Try casting the field that you use to order the CTE. Something like

;WITH RunningTotals AS
(
SELECT TOP (2147483647)
Number, RT
FROM #testCTE
ORDER BY CAST(NEWID() AS VARCHAR(50)) -- also with the field "Number"

-- both of there work correctly
-- ORDER BY Number DESC
-- ORDER BY NEWID()
)

Mladen
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
@csm
it works just fine with ORDER BY CAST(NEWID() AS VARCHAR(50)) for me.

what works wrong for you?

csm
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
Mmmm... something different between your computer and mine; I tried removing the cast operation (only ORDER BY NEWID()) and it doesn't works; I mean, didn't ordered the results.

In fact, the tests that I'm running now only work if I remove the ORDER BY clause (or order by number).

Well, I'm testing with SQL Server 2008, I don't know if there is any difference with 2005...

Adam Machanic
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
Jeff Moden has some examples that make this fail, depending on the clustering key (and maybe on the SQL Server version -- there are no guarantees that this behavior will definitely work every time). The QP will sometimes, as an optimization, order the rows before doing the update in the same order as the clustered index, if you're doing a large enough update (i.e., large enough % of the rows). I don't know when or why this kicks in but Jeff's counter-examples certainly proved that last time I tried them. I suspect that we could figure out a way to &quot;trick&quot; the optimizer into thinking that only one row will be processed, thereby eliminating the sort, but I don't feel like spending the time to do that when a SQLCLR solution is even faster and is guaranteed to work properly.

Mladen
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
@AdamMachanic:
i've been looking for his examples but couldn't find them. one article that might have them on sqlservercentral was removed. but i'm familiar with that problem and that's precisely why i didn't include any kind of index here.
because of the order by in the cte the execution plan shows the input sort so QP doesn't have to perform the sorting again. that's the whole point of it which has really surprised me. the tests i've run were all 100% correct every time. with index, without index, irrelevant of fragmentation or anythign else.

@csm:
i tried it on 2005 and 2008. same thing.
what ordering are you expecting to see exactly? if you order by ORDER BY CAST(NEWID() AS VARCHAR(50)) the result isn't supposed to come backe in 1,2,3,4,5,6, etc order of the number.

Alex Kuznetsov
2009-07-28
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
Mladen,
You can also consider denormalization:

http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/01/23/denormalizing-to-enforce-business-rules-running-totals.aspx


Jeff Moden
2009-07-29
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
For some reason, this blog won't take the code examples I wrote....

The reason I took my article down from SSC is for the very reason you blogged... if an index does happen to come into play, it will destroy the order of the ORDER BY in the UPDATE. Also, the whole purpose of doing such a thing as this is for blinding speed. If you add a clustered index in the order you want the running total to be in the target table, you'll get almost 3 times the performance of the method you took the good time to post and you won't need the CTE and you won't need an ORDER BY. In fact, the clustered index will override the ORDER BY because an UPDATE on a single table with a clustered index will always do it in the same order as the clustered index... hence the term "quirky update". So, the ORDER BY is a bit misleading because it has absolutely no effect on tables that have a clustered index. My concern here is that someone won't read your entire blog entry and might thing that an ordered UPDATE will work on any table and it just won't.

Write back if you want. Sorry my code examples wouldn't post. Guess you'll have to read the article once I republish it with caveats to consider like this one and other things like partitioned tables, etc.

Mladen
2009-07-29
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
I've sent a mail to Jeff asking him to send me the code so i can see if i can post it.

this is what i've mailed him concerning my code:
I'm quite familiar with all the problems about this update technique and the clustered index order update so i was a surprised to see this work.
hence the question mark in the title :)
I've tried adding the clustered and a non clustered index to the #tempCTE table and it made no difference. even tried fragmenting it.
every time the code ran perfectly correct. I mentioned that after the code.
so yes i was pretty surprised about this behaviour, hence the blog post since i knew there'd be some nice debate :)

Peso
2009-07-29
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
I'll try to explain it little more.

If you create the clustered index AFTER the table is populated, the records will be rearranged but not within the records.
That's what the page offset is for.

Now this is not the perfect technical explanation, but this is probably what happens.

Mladen
2009-07-29
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
ah i've found the point jeff was trying to make. if i add the clustered index after the table is populated then the order by is not honored.
however if i add the CI before the table is populated and the order is sometimes honored and sometimes not. i don't know why this would be so.

if there are no indexes the order is honored every time.

Peso
2009-07-30
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
SELECT SUM(qty) OVER (PARTITION BY EmpID ORDER BY OrderMonth ROWS BETWEEN 2147483647 PRECEDING AND CURRENT ROW) AS RT
FROM dbo.Orders
ORDER BY EmpID,
OrderMonth

Peso
2009-07-30
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
SELECT RowID ,Col1,
SUM(Col1) OVER(ORDER BY RowId ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) AS Col2
FROM tablehh
ORDER BY RowId;

Arnold Fribble
2009-08-03
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?

Hey Peter (Peso),
Do you have some secret MVP techniques for making those queries work before 2011? (and then only if we're lucky)

SQL Lion
2010-08-17
re: SQL Server 2005 – Fast Running Totals solution with ordered CTE update?
CTEs(Common Table Expressions) are one of the beautiful and the most powerful feature of SQL Server and are boon to SQL Server developers. These not only simplifies most of the complex operations in T-SQL but also do a lot more than one can imagine. Follow the link to know the details…
www.sqllion.com/.../common-table-expressions-cte/