Fast T-SQL Row Number Generator
You can expect:
This is a medium intensity post (you should not get a migraine, but I make no promises).I Expect (that you have the following experience):
- T-SQL Stored Proc experience
- Windowed Function experience
- Basic Join understanding
Row Number Generation
There are a number of ways to generate row numbers in T-SQL. One way is to create a temp table with an identity column and do inserts into that table. Typically you do it in a while loop with a counter. I've used this method in the past and it works ok. I was helping out a friend who was trying to help out a co-worker with this very problem. I started playing around with other ways to generate row numbers / IDs and came up with what I think is a rather novel way of doing it. There are several pieces to this and I'll cover each of them.Windowed Functions
SQL Server 2005 introduced some "in-place aggregate" functions. These allow you to get results you used to have to use a Group By to get, without the Group By. They are helpful in situations where you need extra data that you just can't keep in your Select when using a Group By. One very useful one is the Row_Number() function. It has tremendous usefulness and I'm only using it in a very basic manner.Fast Data Generation
I needed a way to quickly generate data. SQL Server is meant for set based processing and so things like looping and UDF are not very speedy. You should all know this and if you don't, you need to read more! On a side rant, I've seen developers treat T-SQL like it was VB (I say VB because the people who do this are typically green developers). T-SQL is NOT a programming language! It is a Database language!I decided to use SQL's set power for my purposes. For those that need a refresher, a Full Cross Join will produce every possible combination of data from 2 or more tables. My method was to create an in-memory table with 100 rows with 1 column. That column contained the numbers 1 through 100. A Full Cross Join of that table to itself would result in a Select with 100 * 100 = 10,000 records. The Select is lightning fast compared to looping 10k times. Adding 2 additional Full Joins results in the possibility of creating 100m records. If you wanted to do more, you could add more Full Cross Joins.
Execution Plan Optimization
If I only want 5,000 row numbers, how do I limit my Full Cross Joins to only produce that many? I could try to only insert the correct amount of records into my temp table so that the result is 5k. This, however, would require computing the exact number, plus I wouldn't be able to generate any possible number I wanted. I would not, for instance, be able to generate 58,391 row numbers. I again turn to the power of SQL Server's set powers. I wrapped the Full Cross Join Select inside another Select that had a Where to limit my results. One might think that the SubSelect would have to finish before the outer Select would be run, but that is not the case. The Optimizer looks at the execution plan and sees that I only want 58,391 rows and stops processing the SubSelect once that has been reached. Therefore, I don't generate 100m rows and then only return 58,391, I only generate 58,391 rows.The Query
Create Procedure GenerateRowNumbers(@NumberOfRows int, @StartNumber int)
As Begin
Declare @NumGen Table (Num int)
Declare @cnt int
Set @cnt = 1
While @cnt <= 100 Begin
Insert Into @NumGen
Select @cnt
Set @cnt = @cnt + 1
End
Select @StartNumber + RowNum
From
(
Select Row_Number() Over (Order By N1.Num) As RowNum
From @NumGen N1, @NumGen N2, @NumGen N3, @NumGen N4
) RowNums
Where RowNum <= @NumberOfRows
End
Conclusion
This method consumes much less memory than filling a temp table using a while loop. It also consumes less CPU cycles. This method is very fast and generates about 100k row numbers per second. It handles the offset so you can start at any point and generate any number of row numbers. It does have a 100m limit, but that can be overcome by adding another Full Cross Join.-Madman Out-
Legacy Comments
Saggi Neumann
2008-05-30 |
re: Fast T-SQL Row Number Generator Hi James, Few things I don't understand: 1. Why generate a table in every execution of the sproc? You can use a cross join of sys.columns with itself for lots of numbers, or simply generate a numbers table in advance. 2. I believe you want "Select @StartNumber + RowNum - 1" instead of "Select @StartNumber + RowNum" as Row_Number is 1 based, not 0 based. There are lots of methods of creating numbers in SQL, you can read about them here: http://www.projectdmx.com/tsql/tblnumbers.aspx My favorite is Itzik Ben Gan's Row_Number/CTE solution, which doesn't use any table, only constant scans: WITH Nbrs_3( n ) AS ( SELECT 1 UNION SELECT 0 ), Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 ), Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 ), Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 ), Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 ) SELECT n FROM ( SELECT ROW_NUMBER() OVER (ORDER BY n) FROM Nbrs ) D ( n ) WHERE n <= 500 ; Cheers, S. Neumann |
Jeff
2008-05-30 |
re: Fast T-SQL Row Number Generator James -- where you say "FULL JOIN" a few times, I think you mean "CROSS JOIN". Also, consider writing your FROM clause stating the CROSS JOIN between your tables to make it clear what you are doing: FROM @NumGen N1 CROSS JOIN @NumGen N2 CROSS JOIN@NumGen N3 CROSS JOIN @NumGen N4 Longer, yes, but now it is crystal clear. |
Jeff
2008-05-30 |
re: Fast T-SQL Row Number Generator Also -- I agree with Saggi; if speed is a goal, make your table of numbers permanent, don't re-create it each time. One last thing -- in this case, you don't need row_number() at all, you can use math. Not sure which would be faster. i.e., something like: @StartNumber + n1.num * 10000 + n2.num * 1000 + n2.num * 100 + n1.num |
Jeff
2008-05-30 |
re: Fast T-SQL Row Number Generator (I butchered that last expression, but you get the idea; it also helps if your numbers are 0-based and not 1-based) |
Arnold Fribble
2008-05-30 |
re: Fast T-SQL Row Number Generator I like the query that Saggi posted, although I'm surprised it's a UNION rather than a UNION ALL. If you add CTEs up to Nbrs_7 in the obvious way (so you are generating 2^(2^8) rows) then my SQL Server runs out of stack space with a UNION but is fine with a UNION ALL. The execution plan gets a bit big though! |
Madman
2008-06-03 |
re: Fast T-SQL Row Number Generator Ok, I'll address all the concerns here. #1, I generate an in-memory table and not use syscolumns because I cannot guarantee how much data will be in syscolumns. With the way I created it, I can always be sure I will get enough numbers. I also am not hitting the disk in this operation so I don't have that IO overhead (not that much, but every little bit counts). #2, I want it 1 based, not 0 based. #3, The common table expression one is slightly faster, but needed a little modification from what you posted as the one posted only generates 65535 records w/ no Where. Another problem with CTEs is that they can be memory and CPU intensive due to their recursive nature. #4, You are right, it's Cross join not Full, I've corrected the post. #5, The point is not to have a permanent table with the values, but rather to be able to create them on the fly as needed. #6, Replacing the Row_Number() function does not allow the Optimizer to stop processing after I've reached my limit. |
Toni
2008-06-06 |
re: Fast T-SQL Row Number Generator What would be the reason not to use the TOP xx phrase to limit your results? I also would think having a "tally" table (of only numbers from 1 to ump-teen million rows) available for anyone to use could add overall speed to queries and provide the additional functionality without needing to recreate the table in each query. (This method is adapted from work done by Jeff Moden on another site) Also adding a clustered index on the table greatly increases overall query speed which is a tough thing to do on a table variable. Having the Tally table available as a reusable asset to the shop makes it worthwhile to create the clustered index. Once built the Tally table can be made available with public access for read. The syscolumns joined with syscolumns,,, etc can get you up to pretty large numbers very quickly. Say you think 100 million is a good top end, create the table once making sure you get up to the 100 million rows and then it is there for all to use (and is actually quite small in disk space). Recreating a large table each time would reduce the savings from using it to speed queries - no? For instance creating this 10 million row table took about 40 seconds (dedicated on my laptop). Toni /* *********************** * * This code creates a Tally table to be used for converting do loops into * set-based solutions. * * Change the Use statement to your database and change the * tablename from 'Tally' to your table name in quotes 'yourtablename'. * To set the number of rows to be created, change TOP 10000 to TOP yournumber */ Use SQLSrvC if exists (select name from sysobjects where name='Tally' and type='U') drop table Tally select TOP 10000000 IDENTITY(INT,1,1) AS N into dbo.tally from syscolumns sc1 cross join syscolumns sc2 cross join syscolumns sc3 alter table tally add constraint PK_Tally primary key clustered (n) with fillfactor=100 go sp_help tally |
Narahari
2008-06-08 |
re: Fast T-SQL Row Number Generator This is really good. |
Nazila
2011-08-17 |
re: Fast T-SQL Row Number Generator Using the Identity(Int,1,1) idea was great! Thanks for your help :) |