Peter Larsson Blog

Patron Saint of Lost Yaks

Timings of different techniques for finding missing records

I did some tests today to measure the different approaches for finding records present in one table but not in another. The results of CPU and Duration are presented below with some help from SQL Profiler.
Number of reads are equal between methods but different depending how many record in #TableB.

If there are other methods I haven't included, please let me know.


Method      TableA   TableB   CPU  Duration
----------  -------  -------  ---  --------
GROUP BY    1000000  1000000  748       754
LEFT JOIN   1000000  1000000  328       321
NOT EXISTS  1000000  1000000  265       288
NOT IN      1000000  1000000  296       293
EXCEPT      1000000  1000000  312       288
GROUP BY    1000000   500000  577      2984
LEFT JOIN   1000000   500000  328      2930
NOT EXISTS  1000000   500000  187      2991
NOT IN      1000000   500000  312      2861
EXCEPT      1000000   500000  234      2882
GROUP BY    1000000        0  453      5866
LEFT JOIN   1000000        0  280      5791
NOT EXISTS  1000000        0  125      5855
NOT IN      1000000        0  250      5825
EXCEPT      1000000        0  234      5798

Here is the code for testing

CREATE TABLE     #TableA
                 (
                          i INT PRIMARY KEY CLUSTERED
                 )
 
INSERT   #TableA
SELECT   Number
FROM     dbo.F_TABLE_NUMBER_RANGE(0, 999999)
ORDER BY Number
 
CREATE TABLE     #TableB
                 (
                          i INT PRIMARY KEY CLUSTERED
                 )
 
INSERT   #TableB
SELECT   Number
FROM     dbo.F_TABLE_NUMBER_RANGE(0, 999999)
ORDER BY Number
 
DELETE   f
FROM     (
                 SELECT   TOP 500000
                          i
                 FROM     #TableB
                 ORDER BY NEWID()
         ) AS f
 
-- GROUP BY
SELECT    i
FROM      (
              SELECT   0 AS s, i FROM #TableA UNION ALL
              SELECT   1, i FROM #TableB
          ) AS d
GROUP BY  i
HAVING    MAX(s) = 0
 
-- LEFT JOIN
SELECT      a.i
FROM        #TableA AS a
LEFT JOIN   #TableB AS b ON b.i = a.i
WHERE       b.i IS NULL
 
-- NOT EXISTS
SELECT   a.i
FROM     #TableA AS a
WHERE    NOT EXISTS (SELECT b.i FROM #TableB AS b WHERE b.i = a.i)
 
-- NOT IN
SELECT   a.i
FROM     #TableA AS a
WHERE    a.i NOT IN (SELECT b.i FROM #TableB AS b)

-- EXCEPT
SELECT i FROM #TableA
EXCEPT
SELECT i FROM #TableB
 
DROP TABLE    #TableA,
              #TableB

Legacy Comments


Charles Gildawie
2009-06-12
re: Timings of different techniques for finding missing records
EXCEPT is pretty interesting. Never used that before.

Why the WHERE clause in the two sub - queries?

-- NOT IN
SELECT a.i
FROM #TableA AS a
WHERE a.i NOT IN (SELECT b.i FROM #TableB AS b WHERE b.i = a.i)
is that different from

-- NOT IN (no WHERE clause)
SELECT a.i
FROM #TableA AS a
WHERE a.i NOT IN (SELECT i FROM #TableB)

---------
[NB] -- meant the NOT IN rather than the exists
Charlie

Peso
2009-06-12
re: Timings of different techniques for finding missing records
Great spot!
I will have to run the NOT IN again, after the weekend. Come back beginning of next week for updated timings.

Saggi Neumann
2009-06-13
re: Timings of different techniques for finding missing records
Hi Peter,

Interesting post.

Did you run each scenario & query once or did you try to take an average?

The NOT EXISTS, NOT IN (which has the same QP as a non-corelated NOT IN) and the EXCEPT queries all have a very similiar query plan so they should on average be the same, with the logical difference between NOT EXISTS and EXCEPT vs. NOT IN and LEFT JOIN - that a row with NULL in the key column is returned by the former and not by the latter.

Also, the duration increases as you *return* more rows (there are actually less reads and CPU in the 1,000,000 : 0 scenario). If you return the count of missing rows, you'll get the same ratio between the GROUP BY method and the rest of the queries (around 2:1).

Cheers,
S. Neumann

Peso
2009-06-13
re: Timings of different techniques for finding missing records
The timings are the median values after 10 runs. Minimum and maximum values are deleted, so the median value are for 8 data points.

Carl Federl
2009-06-13
re: Timings of different techniques for finding missing records
As Saggi has posted, the query plans for NOT EXISTS, NOT IN and EXCEPT are identical, which raises the question of why the resource utilization reported are different. From runs I have made using SQLCMD and Stored Procedures, the standard deviation of CPU Ms in about 10% of the average CPU Ms.

Two points of interest:
The "NOT IN" solution will only be viable is the primary key is a single column as SQL Server does not support this ISO SQL (where i and j are the primary key columns)
SELECT i , j
from #TableA
where ( i ,j ) NOT IN (select i , j from #TableB )

If the desired result is all columns of TableA and not just the primary key columns, then the solutions will be similar for the 'NOT EXISTS", 'LEFT JOIN" and "NOT IN" but for the other SQL, a JOIN back to TableA will be required, such as:

SELECT A.i, A.j, A.k
FROM #TableA as A
JOIN (
SELECT i , j FROM #TableA
EXCEPT
SELECT i m j FROM #TableB
) as M
on M.i = A.i
and M.j = A.j


Charles Gildawie
2009-06-16
re: Timings of different techniques for finding missing records
So did removing the WHERE clause from the NOT IN (SELECT....) section make any difference at all?

Charlie.

Peso
2009-06-16
re: Timings of different techniques for finding missing records
Not at all, it actually become a about 5-10 milliseconds slower for CPU!

Charles Gildawie
2009-06-16
re: Timings of different techniques for finding missing records
Weird.

do you see any diff in the execution plan?

also, does WHERE 1 = 1 give you the same results as the WHERE b.i = a.i did?

Peso
2009-06-16
re: Timings of different techniques for finding missing records
TableA 1,000,000 records, TableB 1,000,000 records

-- NOT IN (...)
CPU 281 Duration 288

-- NOT IN (...WHERE b.i = a.i)
CPU 281 Duration 300

-- NOT IN (...WHERE 1 = 1)
CPU 234 Duration 287

TableA 1,000,000 records, TableB 500,000 records

-- NOT IN (...)
CPU 281 Duration 3050

-- NOT IN (...WHERE b.i = a.i)
CPU 234 Duration 2894

-- NOT IN (...WHERE 1 = 1)
CPU 234 Duration 2909

TableA 1,000,000 records, TableB 0 records

-- NOT IN (...)
CPU 202 Duration 5802

-- NOT IN (...WHERE b.i = a.i)
CPU 156 Duration 6046

-- NOT IN (...WHERE 1 = 1)
CPU 172 Duration 5792

Charles Gildawie
2009-06-16
re: Timings of different techniques for finding missing records
OK...

These are averaged runs again?

So it looks like it's better to have a WHERE clause of some description but it seems to make little difference if that establishes a relation or not.

I guess the difference in cpu and duration must be just making up the query plan for the sub-query?

I'm glad that NOT EXISTS is the better choice!

MOHAN DEVAL
2011-04-07
re: Timings of different techniques for finding missing records
Good N Great, ....