SQL Server - How to get the whole group of duplicate rows

Thu Oct 15, 2009 by Mladen Prajdić in sql-server

We all know how to find only duplicated rows in a table.

Since SQL Server 2005 this became really simple using a ROW_NUMBER() window function like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them    
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RN,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   RN > 1
ORDER BY AddressLine1, City, AddressID

The IO for this method is great. We get only one pass through the table:

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

However the problem with this code is that it returns only the duplicated rows and not the original row the duplicates originated from.

Good way

We want a fast way of getting the whole duplicated group. My first attempt was to use the Count(*) with OVER to get the group count in each row like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  COUNT(*) OVER(PARTITION BY AddressLine1, City ) AS CNT,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   CNT > 1
ORDER BY AddressLine1, City, AddressID

But looking at the IO this caused made my head explode and i wanted a better way.

Table 'Worktable'. Scan count 3, logical reads 117473, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Better way

We could get the same thing by using 2 ROW_NUMBER()-ed column, one in ascending and the other on descending order like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID DESC) AS RND,
            ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID)      AS RNA,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
-- this condition removes the rows that don’t have duplicates 
-- RNA-RND = 0 gets all odd rows in the group: 1st, 3rd, etc...
-- RNA = 1 AND RND = 1 limit those rows to those groups that don’t have duplicates 
-- because if there’s only one row both RNA and RND will be 1
-- by negating the condition we return the whole duplicated group
WHERE NOT (RNA-RND = 0 AND RNA = 1 AND RND = 1)

ORDER BY AddressLine1, City, AddressID

This made the IO a normal single pass at the table.

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

If anyone has a better way do let me know.

Legacy Comments


Peso 2009-10-15	re: SQL Server - How to get the whole group of duplicate rows You can't get less logical reads than that, but you can do a lot for the CPU overhead. In my tests, this cuts the CPU in half WHERE RNA + RND > 2

Jonathan R 2009-10-18	re: SQL Server - How to get the whole group of duplicate rows How much IO does this old fashioned way use? USE AdventureWorks GO SELECT * FROM Person.Address A INNER JOIN (SELECT AddressLine1, City FROM Person.[Address] GROUP BY AddressLine1, City HAVING COUNT(*) > 1) AS A1 ON A.AddressLine1 = A1.AddressLine1 AND A.City = A1.City ORDER BY A.AddressLine1, A.City, A.AddressID

Mohamma Tariq Ansari 2009-10-22	How to write recursive function in SQL Server I am getting problem in make a function for the following situation. Suppose i have a table named Opportunity_Duplicate_Tbl Opp_ID Dup_ID ================== R1 111 200 R2 111 300 R3 111 400 R4 222 56 R5 300 200 R6 101 400 R7 112 50 R8 100 101 R1,R2.... are just row number and not part of actual data. Here we have two cols Opp_ID and Dup_ID. Here 200,300,400 are duplicates of 111. Total Duplicates of 111 are 3 right now. But 200 and 400 are duplicates of 300 and 101 respectively. Think relation like a=b and b=c, then c=a or a=c also. So total duplicates of 111 are 5.As 101 is duplicate of 100 and 400 is duplicate of 101.. we have ommited the R5 row since 200 and 300 are direct duplicates of 111. So 200,300,400,101 and 100 are total duplicates of 111. I have solve the problem of getting total number of "Dup_ID" for each "Opp_ID".Please help me out. Thanks in advance.

gopal 2009-10-26	re: SQL Server - How to get the whole group of duplicate rows select Firstname,LastName,Deptno from Empinfo where Deptno=(select Deptno from Empinfo group by Deptno having count(*)>1 )