Lightning fast collapsed date ranges and missing date ranges
The last two days I have been involved in a rather interesting discussion.
The original poster wanted a fast way to get missing date ranges in a series of date pairs.
Naturally I posted the link to the Script Library topic http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=88422
Traditional T-SQL proved to be very inefficient. Even when using the CTE approch which proved to be second fastest and still 100 to 1,600 times slower!
I started out with creating 1,000 date pairs with following code
-- Prepare sample data
CREATE TABLE #ProcessCellAllocation
(
AllocationID INT IDENTITY(1, 1) NOT NULL,
ProcessCell VARCHAR(50) NOT NULL,
DateFrom DATETIME NOT NULL,
DateTo DATETIME,
Seq INT
)
INSERT #ProcessCellAllocation
(
ProcessCell,
DateFrom
)
SELECT TOP 10000
CHAR(65 + ABS(CHECKSUM(NEWID())) % 26),
25000 + ABS(CHECKSUM(NEWID())) % 25000
FROM syscolumns AS c1
CROSS JOIN syscolumns AS c2
UPDATE #ProcessCellAllocation
SET DateTo = DATEADD(DAY, ABS(CHECKSUM(NEWID())) % 30, DateFrom)
CREATE CLUSTERED INDEX IX_DateFrom ON #ProcessCellAllocation (ProcessCell, DateFrom, DateTo) WITH FILLFACTOR = 95
The various techniques used to produce the wanted results were very inefficient.
Ultimately I come up with this idea, which produced both collapsed date ranges and missing date ranges.
I started out with clearing a preexisting Seq columns. I can clear all records or only a certain range of ProcessCells.
UPDATE #ProcessCellAllocation
SET Seq = NULL
WHERE ProcessCell BETWEEN 'A' AND 'F'
I then initialize some variables to speed up counting
DECLARE @Seq INT,
@ProcessCell VARCHAR(50),
The original poster wanted a fast way to get missing date ranges in a series of date pairs.
Naturally I posted the link to the Script Library topic http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=88422
Traditional T-SQL proved to be very inefficient. Even when using the CTE approch which proved to be second fastest and still 100 to 1,600 times slower!
I started out with creating 1,000 date pairs with following code
-- Prepare sample data
CREATE TABLE #ProcessCellAllocation
(
AllocationID INT IDENTITY(1, 1) NOT NULL,
ProcessCell VARCHAR(50) NOT NULL,
DateFrom DATETIME NOT NULL,
DateTo DATETIME,
Seq INT
)
INSERT #ProcessCellAllocation
(
ProcessCell,
DateFrom
)
SELECT TOP 10000
CHAR(65 + ABS(CHECKSUM(NEWID())) % 26),
25000 + ABS(CHECKSUM(NEWID())) % 25000
FROM syscolumns AS c1
CROSS JOIN syscolumns AS c2
UPDATE #ProcessCellAllocation
SET DateTo = DATEADD(DAY, ABS(CHECKSUM(NEWID())) % 30, DateFrom)
CREATE CLUSTERED INDEX IX_DateFrom ON #ProcessCellAllocation (ProcessCell, DateFrom, DateTo) WITH FILLFACTOR = 95
The various techniques used to produce the wanted results were very inefficient.
Ultimately I come up with this idea, which produced both collapsed date ranges and missing date ranges.
I started out with clearing a preexisting Seq columns. I can clear all records or only a certain range of ProcessCells.
UPDATE #ProcessCellAllocation
SET Seq = NULL
WHERE ProcessCell BETWEEN 'A' AND 'F'
I then initialize some variables to speed up counting
DECLARE @Seq INT,
@ProcessCell VARCHAR(50),
@DateFrom DATETIME,
@DateTo DATETIME
SELECT TOP 1
@Seq = 0,
@ProcessCell = ProcessCell
@DateFrom = DateFrom,
@DateTo = DateTo
FROM #ProcessCellAllocation
ORDER BY ProcessCell,
DateFrom
This is only for initializing the documented trick I use, the Clustered Index Update.
The code used for the actual update looks like this
UPDATE #ProcessCellAllocation
SET @Seq = CASE
WHEN DateFrom > @DateTo THEN @Seq + 1
WHEN ProcessCell > @ProcessCell THEN @Seq + 1
ELSE @Seq
END,
@DateFrom = CASE
WHEN DateTo > @DateTo THEN DateFrom
WHEN ProcessCell > @ProcessCell THEN DateFrom
ELSE @DateFrom
END,
@DateTo = CASE
WHEN DateTo > @DateTo THEN DateTo
WHEN ProcessCell > @ProcessCell THEN DateTo
ELSE @DateTo
END,
Seq = @Seq,
@ProcessCell = ProcessCell
Now all work is done, and we can either show the collapgsed date ranges with this
-- Get the collapsed date ranges
WHEN DateFrom > @DateTo THEN @Seq + 1
WHEN ProcessCell > @ProcessCell THEN @Seq + 1
ELSE @Seq
END,
@DateFrom = CASE
WHEN DateTo > @DateTo THEN DateFrom
WHEN ProcessCell > @ProcessCell THEN DateFrom
ELSE @DateFrom
END,
@DateTo = CASE
WHEN DateTo > @DateTo THEN DateTo
WHEN ProcessCell > @ProcessCell THEN DateTo
ELSE @DateTo
END,
Seq = @Seq,
@ProcessCell = ProcessCell
Now all work is done, and we can either show the collapgsed date ranges with this
-- Get the collapsed date ranges
SELECT ProcessCell,
MIN(DateFrom) AS DateFrom,
MAX(DateTo) AS DateTo
FROM #ProcessCellAllocation
GROUP BY ProcessCell,
Seq
ORDER BY Seq
Or we can get the missing date ranges with this
-- Get the missing date ranges
MIN(DateFrom) AS DateFrom,
MAX(DateTo) AS DateTo
FROM #ProcessCellAllocation
GROUP BY ProcessCell,
Seq
ORDER BY Seq
Or we can get the missing date ranges with this
-- Get the missing date ranges
SELECT a.ProcessCell,
a.DateFrom,
b.DateTo
FROM (
SELECT ProcessCell,
Seq,
MAX(DateTo) AS DateFrom
FROM #ProcessCellAllocation
GROUP BY ProcessCell,
Seq
) AS a
a.DateFrom,
b.DateTo
FROM (
SELECT ProcessCell,
Seq,
MAX(DateTo) AS DateFrom
FROM #ProcessCellAllocation
GROUP BY ProcessCell,
Seq
) AS a
INNER JOIN (
SELECT ProcessCell,
Seq,
MIN(DateFrom) AS DateTo
FROM #ProcessCellAllocation
GROUP BY ProcessCell,
Seq
) AS b ON b.ProcessCell = a.ProcessCell
WHERE a.Seq = b.Seq - 1
ORDER BY a.Seq
SELECT ProcessCell,
Seq,
MIN(DateFrom) AS DateTo
FROM #ProcessCellAllocation
GROUP BY ProcessCell,
Seq
) AS b ON b.ProcessCell = a.ProcessCell
WHERE a.Seq = b.Seq - 1
ORDER BY a.Seq
This technique is very fast!
For 1,000 date pairs the algorithm runs in 80 ms.
For 10,000 date pairs the algorithm runs in 360 ms.
For 100,000 date pairs the algorithm runs in 900 ms.
For 1,000,000 date pairs the algorithm runs in 2250 ms.
Legacy Comments
--Jeff Moden
2008-08-12 |
re: Lightning fast collapsed date ranges and missing date ranges Nicely done, Peter... but you really need to change your tab size or convert them to spaces ;-) |
NewSQL
2009-08-08 |
re: Lightning fast collapsed date ranges and missing date ranges Is it possible to do the same to merge data ranges when they are contiguous (rather than overlapping?). For example, Seq No. Group From To 1 1 1/1/1950 12/31/1970 2 1 1/1/1971 12/31/1990 3 1 1/1/1991 12/31/2010 4 2 1/1/2005 12/31/2005 5 2 1/1/2007 12/31/2007 Output: Group From To 1 1/1/1950 12/31/2010 <-- merged first 3 rows 2 1/1/2005 12/31/2005 <-- cannot merge 2 1/1/2007 12/31/2007 <-- cannot merge Thanks |
Peso
2009-08-08 |
re: Lightning fast collapsed date ranges and missing date ranges Sure. Change ">" to ">=". |
Ben
2009-11-02 |
re: Lightning fast collapsed date ranges and missing date ranges Any thoughts on how to get missing date ranges in access.mdb SQL |
Langston
2010-12-23 |
re: Lightning fast collapsed date ranges and missing date ranges Thanks Peso!!! |
Darin
2011-03-02 |
re: Lightning fast collapsed date ranges and missing date ranges Wow. Just Wow. That, my friend, has to be one of the slickest SQL tricks I've seen in a LONG time. Took a 60+ minute query down to <30seconds. That's definitely one for the ol' toolbox! Thanks! |
Wes
2011-05-28 |
re: Lightning fast collapsed date ranges and missing date ranges I'm fairly new to SQL. I've solved this problem before using object oriented programming. I don't completely understand your method here. You didn't explain what some of these columns are or what they mean like AllocationID, ProcessCell, and Seq. Given raw data like: DECLARE @TimeFrames AS TABLE (start_date DATETIME,end_date DATETIME) INSERT @TimeFrames SELECT TOP 10 DateFrom as start_date,DateTo as end_date FROM #ProcessCellAllocation SELECT * FROM @TimeFrames t ------------ Results: ------------ start_date end_date 1968-06-23 00:00:00.000 1968-07-10 00:00:00.000 1968-08-14 00:00:00.000 1968-08-20 00:00:00.000 1968-10-08 00:00:00.000 1968-11-02 00:00:00.000 1968-12-15 00:00:00.000 1968-12-21 00:00:00.000 1969-06-29 00:00:00.000 1969-07-19 00:00:00.000 1969-09-11 00:00:00.000 1969-09-14 00:00:00.000 1969-11-21 00:00:00.000 1969-12-08 00:00:00.000 1969-12-03 00:00:00.000 1969-12-24 00:00:00.000 1970-01-26 00:00:00.000 1970-02-03 00:00:00.000 1970-03-12 00:00:00.000 1970-03-30 00:00:00.000 How would one collapse this? Thanks, Wes |