Mladen Prajdić Blog

Blog about stuff and things and stuff. Mostly about SQL server and .Net

SQL Server - How to get the whole group of duplicate rows

We all know how to find only duplicated rows in a table.

Since SQL Server 2005 this became really simple using a ROW_NUMBER() window function like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them    
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RN,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   RN > 1
ORDER BY AddressLine1, City, AddressID
 

The IO for this method is great. We get only one pass through the table:

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 

However the problem with this code is that it returns only the duplicated rows and not the original row the duplicates originated from.

Good way

We want a fast way of getting the whole duplicated group. My first attempt was to use the Count(*) with OVER to get the group count in each row like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  COUNT(*) OVER(PARTITION BY AddressLine1, City ) AS CNT,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   CNT > 1
ORDER BY AddressLine1, City, AddressID
 

But looking at the IO this caused made my head explode and i wanted a better way.

Table 'Worktable'. Scan count 3, logical reads 117473, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Better way

We could get the same thing by using 2 ROW_NUMBER()-ed column, one in ascending and the other on descending order like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID DESC) AS RND,
            ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID)      AS RNA,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
-- this condition removes the rows that don’t have duplicates 
-- RNA-RND = 0 gets all odd rows in the group: 1st, 3rd, etc...
-- RNA = 1 AND RND = 1 limit those rows to those groups that don’t have duplicates 
-- because if there’s only one row both RNA and RND will be 1
-- by negating the condition we return the whole duplicated group
WHERE NOT (RNA-RND = 0 AND RNA = 1 AND RND = 1)
ORDER BY AddressLine1, City, AddressID
 

This made the IO a normal single pass at the table.

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 

 

If anyone has a better way do let me know.

kick it on DotNetKicks.com

Legacy Comments


Peso
2009-10-15
re: SQL Server - How to get the whole group of duplicate rows
You can't get less logical reads than that, but you can do a lot for the CPU overhead.
In my tests, this cuts the CPU in half

WHERE RNA + RND > 2

Jonathan R
2009-10-18
re: SQL Server - How to get the whole group of duplicate rows
How much IO does this old fashioned way use?
USE AdventureWorks
GO
SELECT *
FROM Person.Address A
INNER JOIN (SELECT AddressLine1, City
FROM Person.[Address]
GROUP BY AddressLine1, City
HAVING COUNT(*) > 1) AS A1
ON A.AddressLine1 = A1.AddressLine1
AND A.City = A1.City
ORDER BY A.AddressLine1, A.City, A.AddressID

Mohamma Tariq Ansari
2009-10-22
How to write recursive function in SQL Server
I am getting problem in make a function for the following situation.

Suppose i have a table named Opportunity_Duplicate_Tbl

Opp_ID Dup_ID
==================
R1 111 200
R2 111 300
R3 111 400
R4 222 56
R5 300 200
R6 101 400
R7 112 50
R8 100 101

R1,R2.... are just row number and not part of actual data.
Here we have two cols Opp_ID and Dup_ID.
Here 200,300,400 are duplicates of 111. Total Duplicates of 111 are 3 right now.
But 200 and 400 are duplicates of 300 and 101 respectively.
Think relation like a=b and b=c, then c=a or a=c also.
So total duplicates of 111 are 5.As 101 is duplicate of 100 and 400 is duplicate of 101.. we have ommited the R5 row since 200 and 300 are direct duplicates of 111.

So 200,300,400,101 and 100 are total duplicates of 111.
I have solve the problem of getting total number of "Dup_ID" for each "Opp_ID".Please help me out.
Thanks in advance.

gopal
2009-10-26
re: SQL Server - How to get the whole group of duplicate rows
select Firstname,LastName,Deptno from Empinfo where Deptno=(select Deptno from Empinfo group by Deptno having count(*)>1 )