The simplest way to delete duplicates and compare two result sets in SQL Server
Comparing result sets
There are times, although not often that you have to compare 2 result sets. This usually happens when you're analyzing data for whatever reason or unit testing a database.
Usual methods include
- using the UNION of both queries because it filters duplicate data and checking if the row counts are the same
- using the EXCEPT if you have SQL Server 2005+
- using NOT EXISTS which is very ugly and not worth it
- using a FULL OUTER JOIN and looking at nulls or some other method I haven't mentioned
- returning result sets to client and comparing them there
- some other method I haven't mentioned
But for a while now I've been using this method using BINARY_CHECKSUM, CHECKSUM and CHECKSUM_AGG functions, since they are available in all versions of SQL Server. I prefer BINARY_CHECKSUM over CHECKSUM because BINARY_CHECKSUM is case sensitive. In a case-insensitive database CHECKSUM returns the same value for 'abcd' and 'ABCD' while BINARY_CHECKSUM does not.
Here's the code I use for result set comparison:
Deleting duplicates
Deleting duplicates becomes an easy task with this method both in SQL Server 2000 and 2005+:
It's fast, simple, clean, understandable and works like a charm. If you have a better way do let me know.
Legacy Comments
Peso
2009-01-05re: The simplest way to delete duplicates and compare two result sets in SQL Server
Add this sample data
INSERT INTO #t1(ID, VALUE) VALUES (1,'bQ')
an try again.
Then read this post http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832
and the prelude why checksum is NOT reliable http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70328Peso
2009-01-05re: The simplest way to delete duplicates and compare two result sets in SQL Server
DELETE f
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, ... ORDER BY Col1) AS recID
) AS f
WHERE recID > 2Mladen
2009-01-05re: The simplest way to delete duplicates and compare two result sets in SQL Server
awsome!
i did not know that.
thanx Peter.Peso
2009-01-05re: The simplest way to delete duplicates and compare two result sets in SQL Server
The Microsoft CHECKSUM is a very, VERY, weak algorithm as proved in the links above.
It just a 32-bit value XOR'ed with previous value shifted 4 bits.
I strongly, but kindly, vote against this method of using CHECKSUM for comparison.Zehra Nasif
2009-01-05re: The simplest way to delete duplicates
This will delete all records from the table MyTable which have the same value for the field dupField, leaving that record which has the lowest value in uniqueField.
delete from T1 from MyTable T1, MyTable T2
where T1.dupField = T2.dupField
and
T1.uniqueField > T2.uniqueFieldToni
2009-01-15re: The simplest way to delete duplicates and compare two result sets in SQL Server
In doing any of the comparisons (if that is the way you must go e.g pre-SQL 2005) look out for the case where a field might contain Null. In that case, your duplicate_field_1 = duplicate_field_2 would not find the record.DBA
2009-01-23re: The simplest way to delete duplicates and compare two result sets in SQL Server
Hello.
There is another generic way to compare resultsets. I will try to show an example
select a1, a2, a3, a4, sum(v)
from (
select a1, a2, a3, a4, v = 1
from table1
union
select a1, a2, a3, a4, v = 2
from table2
) t
group by a1, a2, a3, a4,
having sum(v) in (1,2)