I want some Moore

Blog about stuff and things and stuff. Mostly about SQL server and .Net
posts - 218, comments - 2281, trackbacks - 33

My Links

Advertisement

News

Hi! My name is 
Mladen Prajdić  I'm from Slovenia and I'm currently working as a .Net (C#) and SQL Server developer.

I also speak at local user group meetings and conferences like SQLBits and NT Conference
Welcome to my blog.
SQL Server MVP

My Books

SQL Server MVP Deep Dives 2
The Red Gate Guide to SQL Server Team based Development Free e-book

My Blog Feed via Email
Follow MladenPrajdic on Twitter


Users Online: who's online

Article Categories

Archives

Post Categories

Cool software

Other Blogs

Other stuff

SQL stuff

The simplest way to delete duplicates and compare two result sets in SQL Server

Comparing result sets

There are times, although not often that you have to compare 2 result sets. This usually happens when you're analyzing data for whatever reason or unit testing a database.

Usual methods include

- using the UNION of both queries because it filters duplicate data and checking if the row counts are the same

- using the EXCEPT if you have SQL Server 2005+

- using NOT EXISTS which is very ugly and not worth it

- using a FULL OUTER JOIN and looking at nulls or some other method I haven't mentioned

- returning result sets to client and comparing them there

- some other method I haven't mentioned

 

But for a while now I've been using this method using BINARY_CHECKSUM, CHECKSUM and CHECKSUM_AGG functions, since they are available in all versions of SQL Server. I prefer BINARY_CHECKSUM over CHECKSUM because BINARY_CHECKSUM is case sensitive. In a case-insensitive database CHECKSUM returns the same value for 'abcd' and 'ABCD' while BINARY_CHECKSUM does not.

Here's the code I use for result set comparison:

 

USE AdventureWorks GO IF ( SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM ( -- first resultset to compare SELECT * FROM Person.Address ) t1 ) = ( SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM ( -- second resultset to compare SELECT * FROM Person.Address WHERE AddressID < 1000 ) t1 ) BEGIN SELECT 'Resultsets equal' END ELSE BEGIN SELECT 'Resultsets NOT equal' END

 

Deleting duplicates

Deleting duplicates becomes an easy task with this method both in SQL Server 2000 and 2005+:

CREATE TABLE #t1(ID INT NULL, VALUE VARCHAR(2)) INSERT INTO #t1(ID, VALUE) VALUES (1,'aa') INSERT INTO #t1(ID, VALUE) VALUES (2,'bb') INSERT INTO #t1(ID, VALUE) VALUES (1,'aa') INSERT INTO #t1(ID, VALUE) VALUES (1,'aa') INSERT INTO #t1(ID, VALUE) VALUES (3,'cc') INSERT INTO #t1(ID, VALUE) VALUES (3,'cc') GO -- BINARY_CHECKSUM(<column names>): <column names> are columns that we want to compare duplicates for -- if you want to compare the full row then change BINARY_CHECKSUM(<column names>) -> BINARY_CHECKSUM(*) -- for SQL Server 2000+ a loop -- save checksums and rowcounts for duplicates SELECT BINARY_CHECKSUM(ID, VALUE) AS ChkSum, COUNT(*) AS Cnt INTO #t2 FROM #t1 GROUP BY BINARY_CHECKSUM(ID, VALUE) HAVING COUNT(*)>1 DECLARE @ChkSum BIGINT, @rc INT -- get the first checksum and set the rowcount to the count - 1 -- because we want to leave one duplicate SELECT TOP 1 @ChkSum = ChkSum, @rc = Cnt-1 FROM #t2 WHILE EXISTS (SELECT * FROM #t2) BEGIN -- rowcount is one less than the duplicate rows count SET ROWCOUNT @rc DELETE FROM #t1 WHERE BINARY_CHECKSUM(ID, VALUE) = @ChkSum -- remove the processed duplicate from the checksum table DELETE #t2 WHERE ChkSum = @ChkSum -- select the next duplicate rows to delete SELECT TOP 1 @ChkSum = ChkSum, @rc = Cnt-1 FROM #t2 END SET ROWCOUNT 0 GO SELECT * FROM #t1 -- for SQL Server 2005+ a cool CTE ;WITH Numbered AS ( SELECT ROW_NUMBER() OVER (PARTITION BY ChkSum ORDER BY ChkSum) AS RN, * FROM ( SELECT BINARY_CHECKSUM(ID, VALUE) AS ChkSum FROM #t1 ) t ) DELETE FROM Numbered WHERE RN > 1; GO SELECT * FROM #t1 DROP TABLE #t1; DROP TABLE #t2;

It's fast, simple, clean, understandable and works like a charm. If you have a better way do let me know.

 

kick it on DotNetKicks.com
 

Print | posted on Monday, January 05, 2009 2:50 PM | Filed Under [ SQL Server ]

Feedback

Gravatar

# re: The simplest way to delete duplicates and compare two result sets in SQL Server

Add this sample data

INSERT INTO #t1(ID, VALUE) VALUES (1,'bQ')

an try again.

Then read this post http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832
and the prelude why checksum is NOT reliable http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70328
1/5/2009 4:55 PM | Peso
Gravatar

# re: The simplest way to delete duplicates and compare two result sets in SQL Server

DELETE f
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, ... ORDER BY Col1) AS recID
) AS f
WHERE recID > 2
1/5/2009 4:57 PM | Peso
Gravatar

# re: The simplest way to delete duplicates and compare two result sets in SQL Server

awsome!
i did not know that.
thanx Peter.
1/5/2009 5:01 PM | Mladen
Gravatar

# re: The simplest way to delete duplicates and compare two result sets in SQL Server

The Microsoft CHECKSUM is a very, VERY, weak algorithm as proved in the links above.
It just a 32-bit value XOR'ed with previous value shifted 4 bits.

I strongly, but kindly, vote against this method of using CHECKSUM for comparison.
1/5/2009 5:03 PM | Peso
Gravatar

# re: The simplest way to delete duplicates

This will delete all records from the table MyTable which have the same value for the field dupField, leaving that record which has the lowest value in uniqueField.

delete from T1 from MyTable T1, MyTable T2
where T1.dupField = T2.dupField
and
T1.uniqueField > T2.uniqueField

1/5/2009 10:42 PM | Zehra Nasif
Gravatar

# re: The simplest way to delete duplicates and compare two result sets in SQL Server

In doing any of the comparisons (if that is the way you must go e.g pre-SQL 2005) look out for the case where a field might contain Null. In that case, your duplicate_field_1 = duplicate_field_2 would not find the record.
1/15/2009 3:46 PM | Toni
Gravatar

# re: The simplest way to delete duplicates and compare two result sets in SQL Server

Hello.
There is another generic way to compare resultsets. I will try to show an example


select a1, a2, a3, a4, sum(v)
from (
select a1, a2, a3, a4, v = 1
from table1
union
select a1, a2, a3, a4, v = 2
from table2
) t
group by a1, a2, a3, a4,
having sum(v) in (1,2)
1/23/2009 11:12 PM | DBA
Comments have been closed on this topic.

Powered by:
Powered By Subtext Powered By ASP.NET