You can expect:
This is a medium intensity post (you should not get a migraine, but I make no promises).
I Expect (that you have the following experience):
- T-SQL Stored Proc experience
- Windowed Function experience
- Basic Join understanding
Row Number Generation
There are a number of ways to generate row numbers in T-SQL. One way is to create a temp table with an identity column and do inserts into that table. Typically you do it in a while loop with a counter. I've used this method in the past and it works ok. I was helping out a friend who was trying to help out a co-worker with this very problem. I started playing around with other ways to generate row numbers / IDs and came up with what I think is a rather novel way of doing it. There are several pieces to this and I'll cover each of them.
SQL Server 2005 introduced some "in-place aggregate" functions. These allow you to get results you used to have to use a Group By to get, without the Group By. They are helpful in situations where you need extra data that you just can't keep in your Select when using a Group By. One very useful one is the Row_Number() function. It has tremendous usefulness and I'm only using it in a very basic manner.
Fast Data Generation
I needed a way to quickly generate data. SQL Server is meant for set based processing and so things like looping and UDF are not very speedy. You should all know this and if you don't, you need to read more! On a side rant, I've seen developers treat T-SQL like it was VB (I say VB because the people who do this are typically green developers). T-SQL is NOT a programming language! It is a Database language!
I decided to use SQL's set power for my purposes. For those that need a refresher, a Full
Cross Join will produce every possible combination of data from 2 or more tables. My method was to create an in-memory table with 100 rows with 1 column. That column contained the numbers 1 through 100. A Full
Cross Join of that table to itself would result in a Select with 100 * 100 = 10,000 records. The Select is lightning fast compared to looping 10k times. Adding 2 additional Full Joins results in the possibility of creating 100m records. If you wanted to do more, you could add more Full
Execution Plan Optimization
If I only want 5,000 row numbers, how do I limit my Full
Cross Joins to only produce that many? I could try to only insert the correct amount of records into my temp table so that the result is 5k. This, however, would require computing the exact number, plus I wouldn't be able to generate any possible number I wanted. I would not, for instance, be able to generate 58,391 row numbers. I again turn to the power of SQL Server's set powers. I wrapped the Full
Cross Join Select inside another Select that had a Where to limit my results. One might think that the SubSelect would have to finish before the outer Select would be run, but that is not the case. The Optimizer looks at the execution plan and sees that I only want 58,391 rows and stops processing the SubSelect once that has been reached. Therefore, I don't generate 100m rows and then only return 58,391, I only generate 58,391 rows.
Create Procedure GenerateRowNumbers(@NumberOfRows int, @StartNumber int)
Declare @NumGen Table (Num int)
Declare @cnt int
Set @cnt = 1
While @cnt <= 100 Begin
Insert Into @NumGen
Set @cnt = @cnt + 1
Select @StartNumber + RowNum
Select Row_Number() Over (Order By N1.Num) As RowNum
From @NumGen N1, @NumGen N2, @NumGen N3, @NumGen N4
Where RowNum <= @NumberOfRows
This method consumes much less memory than filling a temp table using a while loop. It also consumes less CPU cycles. This method is very fast and generates about 100k row numbers per second. It handles the offset so you can start at any point and generate any number of row numbers. It does have a 100m limit, but that can be overcome by adding another Full