Timings of different techniques for finding missing records
I did some tests today to measure the different approaches for finding records present in one table but not in another. The results of CPU and Duration are presented below with some help from SQL Profiler.
Number of reads are equal between methods but different depending how many record in #TableB.
If there are other methods I haven't included, please let me know.
Method TableA TableB CPU Duration
Number of reads are equal between methods but different depending how many record in #TableB.
If there are other methods I haven't included, please let me know.
Method TableA TableB CPU Duration
---------- ------- ------- --- --------
GROUP BY 1000000 1000000 748 754
LEFT JOIN 1000000 1000000 328 321
NOT EXISTS 1000000 1000000 265 288
NOT IN 1000000 1000000 296 293
EXCEPT 1000000 1000000 312 288
EXCEPT 1000000 1000000 312 288
GROUP BY 1000000 500000 577 2984
LEFT JOIN 1000000 500000 328 2930
NOT EXISTS 1000000 500000 187 2991
NOT IN 1000000 500000 312 2861
EXCEPT 1000000 500000 234 2882
EXCEPT 1000000 500000 234 2882
GROUP BY 1000000 0 453 5866
LEFT JOIN 1000000 0 280 5791
NOT EXISTS 1000000 0 125 5855
NOT IN 1000000 0 250 5825
EXCEPT 1000000 0 234 5798
Here is the code for testing
CREATE TABLE #TableA
EXCEPT 1000000 0 234 5798
Here is the code for testing
CREATE TABLE #TableA
(
i INT PRIMARY KEY CLUSTERED
)
INSERT #TableA
SELECT Number
FROM dbo.F_TABLE_NUMBER_RANGE(0, 999999)
ORDER BY Number
CREATE TABLE #TableB
(
i INT PRIMARY KEY CLUSTERED
)
INSERT #TableB
SELECT Number
FROM dbo.F_TABLE_NUMBER_RANGE(0, 999999)
ORDER BY Number
DELETE f
FROM (
SELECT TOP 500000
i
FROM #TableB
ORDER BY NEWID()
) AS f
-- GROUP BY
SELECT i
FROM (
SELECT 0 AS s, i FROM #TableA UNION ALL
SELECT 1, i FROM #TableB
) AS d
GROUP BY i
HAVING MAX(s) = 0
-- LEFT JOIN
SELECT a.i
FROM #TableA AS a
LEFT JOIN #TableB AS b ON b.i = a.i
WHERE b.i IS NULL
-- NOT EXISTS
SELECT a.i
FROM #TableA AS a
WHERE NOT EXISTS (SELECT b.i FROM #TableB AS b WHERE b.i = a.i)
-- NOT IN
SELECT a.i
FROM #TableA AS a
WHERE a.i NOT IN (SELECT b.i FROM #TableB AS b)
WHERE a.i NOT IN (SELECT b.i FROM #TableB AS b)
-- EXCEPT
SELECT i FROM #TableA
EXCEPT
SELECT i FROM #TableB
EXCEPT
SELECT i FROM #TableB
DROP TABLE #TableA,
#TableB
Legacy Comments
Charles Gildawie
2009-06-12 |
re: Timings of different techniques for finding missing records EXCEPT is pretty interesting. Never used that before. Why the WHERE clause in the two sub - queries? -- NOT IN SELECT a.i FROM #TableA AS a WHERE a.i NOT IN (SELECT b.i FROM #TableB AS b WHERE b.i = a.i) is that different from -- NOT IN (no WHERE clause) SELECT a.i FROM #TableA AS a WHERE a.i NOT IN (SELECT i FROM #TableB) --------- [NB] -- meant the NOT IN rather than the exists Charlie |
Peso
2009-06-12 |
re: Timings of different techniques for finding missing records Great spot! I will have to run the NOT IN again, after the weekend. Come back beginning of next week for updated timings. |
Saggi Neumann
2009-06-13 |
re: Timings of different techniques for finding missing records Hi Peter, Interesting post. Did you run each scenario & query once or did you try to take an average? The NOT EXISTS, NOT IN (which has the same QP as a non-corelated NOT IN) and the EXCEPT queries all have a very similiar query plan so they should on average be the same, with the logical difference between NOT EXISTS and EXCEPT vs. NOT IN and LEFT JOIN - that a row with NULL in the key column is returned by the former and not by the latter. Also, the duration increases as you *return* more rows (there are actually less reads and CPU in the 1,000,000 : 0 scenario). If you return the count of missing rows, you'll get the same ratio between the GROUP BY method and the rest of the queries (around 2:1). Cheers, S. Neumann |
Peso
2009-06-13 |
re: Timings of different techniques for finding missing records The timings are the median values after 10 runs. Minimum and maximum values are deleted, so the median value are for 8 data points. |
Carl Federl
2009-06-13 |
re: Timings of different techniques for finding missing records As Saggi has posted, the query plans for NOT EXISTS, NOT IN and EXCEPT are identical, which raises the question of why the resource utilization reported are different. From runs I have made using SQLCMD and Stored Procedures, the standard deviation of CPU Ms in about 10% of the average CPU Ms. Two points of interest: The "NOT IN" solution will only be viable is the primary key is a single column as SQL Server does not support this ISO SQL (where i and j are the primary key columns) SELECT i , j from #TableA where ( i ,j ) NOT IN (select i , j from #TableB ) If the desired result is all columns of TableA and not just the primary key columns, then the solutions will be similar for the 'NOT EXISTS", 'LEFT JOIN" and "NOT IN" but for the other SQL, a JOIN back to TableA will be required, such as: SELECT A.i, A.j, A.k FROM #TableA as A JOIN ( SELECT i , j FROM #TableA EXCEPT SELECT i m j FROM #TableB ) as M on M.i = A.i and M.j = A.j |
Charles Gildawie
2009-06-16 |
re: Timings of different techniques for finding missing records So did removing the WHERE clause from the NOT IN (SELECT....) section make any difference at all? Charlie. |
Peso
2009-06-16 |
re: Timings of different techniques for finding missing records Not at all, it actually become a about 5-10 milliseconds slower for CPU! |
Charles Gildawie
2009-06-16 |
re: Timings of different techniques for finding missing records Weird. do you see any diff in the execution plan? also, does WHERE 1 = 1 give you the same results as the WHERE b.i = a.i did? |
Peso
2009-06-16 |
re: Timings of different techniques for finding missing records TableA 1,000,000 records, TableB 1,000,000 records -- NOT IN (...) CPU 281 Duration 288 -- NOT IN (...WHERE b.i = a.i) CPU 281 Duration 300 -- NOT IN (...WHERE 1 = 1) CPU 234 Duration 287 TableA 1,000,000 records, TableB 500,000 records -- NOT IN (...) CPU 281 Duration 3050 -- NOT IN (...WHERE b.i = a.i) CPU 234 Duration 2894 -- NOT IN (...WHERE 1 = 1) CPU 234 Duration 2909 TableA 1,000,000 records, TableB 0 records -- NOT IN (...) CPU 202 Duration 5802 -- NOT IN (...WHERE b.i = a.i) CPU 156 Duration 6046 -- NOT IN (...WHERE 1 = 1) CPU 172 Duration 5792 |
Charles Gildawie
2009-06-16 |
re: Timings of different techniques for finding missing records OK... These are averaged runs again? So it looks like it's better to have a WHERE clause of some description but it seems to make little difference if that establishes a relation or not. I guess the difference in cpu and duration must be just making up the query plan for the sub-query? I'm glad that NOT EXISTS is the better choice! |
MOHAN DEVAL
2011-04-07 |
re: Timings of different techniques for finding missing records Good N Great, .... |