Timings of different techniques for finding missing records

Fri Jun 12, 2009 by Peter Larsson in optimization, sql-server-2008, algorithms, sql-server-2005, sql-server-2000

I did some tests today to measure the different approaches for finding records present in one table but not in another. The results of CPU and Duration are presented below with some help from SQL Profiler.
Number of reads are equal between methods but different depending how many record in #TableB.

If there are other methods I haven't included, please let me know.

Method TableA TableB CPU Duration

---------- ------- ------- --- --------

GROUP BY 1000000 1000000 748 754

LEFT JOIN 1000000 1000000 328 321

NOT EXISTS 1000000 1000000 265 288

NOT IN 1000000 1000000 296 293
EXCEPT 1000000 1000000 312 288

GROUP BY 1000000 500000 577 2984

LEFT JOIN 1000000 500000 328 2930

NOT EXISTS 1000000 500000 187 2991

NOT IN 1000000 500000 312 2861
EXCEPT 1000000 500000 234 2882

GROUP BY 1000000 0 453 5866

LEFT JOIN 1000000 0 280 5791

NOT EXISTS 1000000 0 125 5855

NOT IN      1000000        0  250      5825
EXCEPT      1000000        0 234      5798

Here is the code for testing

CREATE TABLE     #TableA

(

i INT PRIMARY KEY CLUSTERED

)

INSERT #TableA

SELECT Number

FROM dbo.F_TABLE_NUMBER_RANGE(0, 999999)

ORDER BY Number

CREATE TABLE #TableB

(

i INT PRIMARY KEY CLUSTERED

)

INSERT #TableB

SELECT Number

FROM dbo.F_TABLE_NUMBER_RANGE(0, 999999)

ORDER BY Number

DELETE f

FROM (

SELECT TOP 500000

FROM #TableB

ORDER BY NEWID()

) AS f

-- GROUP BY

SELECT i

FROM (

SELECT 0 AS s, i FROM #TableA UNION ALL

SELECT 1, i FROM #TableB

) AS d

GROUP BY i

HAVING MAX(s) = 0

-- LEFT JOIN

SELECT a.i

FROM #TableA AS a

LEFT JOIN #TableB AS b ON b.i = a.i

WHERE b.i IS NULL

-- NOT EXISTS

SELECT a.i

FROM #TableA AS a

WHERE NOT EXISTS (SELECT b.i FROM #TableB AS b WHERE b.i = a.i)

-- NOT IN

SELECT a.i

FROM #TableA AS a
WHERE a.i NOT IN (SELECT b.i FROM #TableB AS b)

-- EXCEPT

SELECT i FROM #TableA
EXCEPT
SELECT i FROM #TableB

DROP TABLE #TableA,

#TableB

Legacy Comments


Charles Gildawie 2009-06-12	re: Timings of different techniques for finding missing records EXCEPT is pretty interesting. Never used that before. Why the WHERE clause in the two sub - queries? -- NOT IN SELECT a.i FROM #TableA AS a WHERE a.i NOT IN (SELECT b.i FROM #TableB AS b WHERE b.i = a.i) is that different from -- NOT IN (no WHERE clause) SELECT a.i FROM #TableA AS a WHERE a.i NOT IN (SELECT i FROM #TableB) --------- [NB] -- meant the NOT IN rather than the exists Charlie

Peso 2009-06-12	re: Timings of different techniques for finding missing records Great spot! I will have to run the NOT IN again, after the weekend. Come back beginning of next week for updated timings.

Saggi Neumann 2009-06-13	re: Timings of different techniques for finding missing records Hi Peter, Interesting post. Did you run each scenario & query once or did you try to take an average? The NOT EXISTS, NOT IN (which has the same QP as a non-corelated NOT IN) and the EXCEPT queries all have a very similiar query plan so they should on average be the same, with the logical difference between NOT EXISTS and EXCEPT vs. NOT IN and LEFT JOIN - that a row with NULL in the key column is returned by the former and not by the latter. Also, the duration increases as you return more rows (there are actually less reads and CPU in the 1,000,000 : 0 scenario). If you return the count of missing rows, you'll get the same ratio between the GROUP BY method and the rest of the queries (around 2:1). Cheers, S. Neumann

Peso 2009-06-13	re: Timings of different techniques for finding missing records The timings are the median values after 10 runs. Minimum and maximum values are deleted, so the median value are for 8 data points.

Carl Federl 2009-06-13	re: Timings of different techniques for finding missing records As Saggi has posted, the query plans for NOT EXISTS, NOT IN and EXCEPT are identical, which raises the question of why the resource utilization reported are different. From runs I have made using SQLCMD and Stored Procedures, the standard deviation of CPU Ms in about 10% of the average CPU Ms. Two points of interest: The "NOT IN" solution will only be viable is the primary key is a single column as SQL Server does not support this ISO SQL (where i and j are the primary key columns) SELECT i , j from #TableA where ( i ,j ) NOT IN (select i , j from #TableB ) If the desired result is all columns of TableA and not just the primary key columns, then the solutions will be similar for the 'NOT EXISTS", 'LEFT JOIN" and "NOT IN" but for the other SQL, a JOIN back to TableA will be required, such as: SELECT A.i, A.j, A.k FROM #TableA as A JOIN ( SELECT i , j FROM #TableA EXCEPT SELECT i m j FROM #TableB ) as M on M.i = A.i and M.j = A.j

Charles Gildawie 2009-06-16	re: Timings of different techniques for finding missing records So did removing the WHERE clause from the NOT IN (SELECT....) section make any difference at all? Charlie.

Peso 2009-06-16	re: Timings of different techniques for finding missing records Not at all, it actually become a about 5-10 milliseconds slower for CPU!

Charles Gildawie 2009-06-16	re: Timings of different techniques for finding missing records Weird. do you see any diff in the execution plan? also, does WHERE 1 = 1 give you the same results as the WHERE b.i = a.i did?

Peso 2009-06-16	re: Timings of different techniques for finding missing records TableA 1,000,000 records, TableB 1,000,000 records -- NOT IN (...) CPU 281 Duration 288 -- NOT IN (...WHERE b.i = a.i) CPU 281 Duration 300 -- NOT IN (...WHERE 1 = 1) CPU 234 Duration 287 TableA 1,000,000 records, TableB 500,000 records -- NOT IN (...) CPU 281 Duration 3050 -- NOT IN (...WHERE b.i = a.i) CPU 234 Duration 2894 -- NOT IN (...WHERE 1 = 1) CPU 234 Duration 2909 TableA 1,000,000 records, TableB 0 records -- NOT IN (...) CPU 202 Duration 5802 -- NOT IN (...WHERE b.i = a.i) CPU 156 Duration 6046 -- NOT IN (...WHERE 1 = 1) CPU 172 Duration 5792

Charles Gildawie 2009-06-16	re: Timings of different techniques for finding missing records OK... These are averaged runs again? So it looks like it's better to have a WHERE clause of some description but it seems to make little difference if that establishes a relation or not. I guess the difference in cpu and duration must be just making up the query plan for the sub-query? I'm glad that NOT EXISTS is the better choice!

MOHAN DEVAL 2011-04-07	re: Timings of different techniques for finding missing records Good N Great, ....