SQL Server - How to get the whole group of duplicate rows
We all know how to find only duplicated rows in a table.
Since SQL Server 2005 this became really simple using a ROW_NUMBER() window function like this:
USE AdventureWorks GO ;WITH cteDupes AS ( -- find all rows that have the same AddressLine1 and City. -- we consider those rows are duplicates so we partition on them SELECT ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RN, * FROM Person.Address ) SELECT * FROM cteDupes WHERE RN > 1 ORDER BY AddressLine1, City, AddressID
The IO for this method is great. We get only one pass through the table:
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
However the problem with this code is that it returns only the duplicated rows and not the original row the duplicates originated from.
Good way
We want a fast way of getting the whole duplicated group. My first attempt was to use the Count(*) with OVER to get the group count in each row like this:
USE AdventureWorks GO ;WITH cteDupes AS ( -- find all rows that have the same AddressLine1 and City. -- we consider those rows are duplicates so we partition on them SELECT COUNT(*) OVER(PARTITION BY AddressLine1, City ) AS CNT, * FROM Person.Address ) SELECT * FROM cteDupes WHERE CNT > 1 ORDER BY AddressLine1, City, AddressID
But looking at the IO this caused made my head explode and i wanted a better way.
Table 'Worktable'. Scan count 3, logical reads 117473, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Better way
We could get the same thing by using 2 ROW_NUMBER()-ed column, one in ascending and the other on descending order like this:
USE AdventureWorks GO ;WITH cteDupes AS ( -- find all rows that have the same AddressLine1 and City. -- we consider those rows are duplicates so we partition on them SELECT ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID DESC) AS RND, ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RNA, * FROM Person.Address ) SELECT * FROM cteDupes -- this condition removes the rows that don’t have duplicates -- RNA-RND = 0 gets all odd rows in the group: 1st, 3rd, etc... -- RNA = 1 AND RND = 1 limit those rows to those groups that don’t have duplicates -- because if there’s only one row both RNA and RND will be 1 -- by negating the condition we return the whole duplicated group WHERE NOT (RNA-RND = 0 AND RNA = 1 AND RND = 1)
ORDER BY AddressLine1, City, AddressID
This made the IO a normal single pass at the table.
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
If anyone has a better way do let me know.
Legacy Comments
Peso
2009-10-15 |
re: SQL Server - How to get the whole group of duplicate rows You can't get less logical reads than that, but you can do a lot for the CPU overhead. In my tests, this cuts the CPU in half WHERE RNA + RND > 2 |
Jonathan R
2009-10-18 |
re: SQL Server - How to get the whole group of duplicate rows How much IO does this old fashioned way use? USE AdventureWorks GO SELECT * FROM Person.Address A INNER JOIN (SELECT AddressLine1, City FROM Person.[Address] GROUP BY AddressLine1, City HAVING COUNT(*) > 1) AS A1 ON A.AddressLine1 = A1.AddressLine1 AND A.City = A1.City ORDER BY A.AddressLine1, A.City, A.AddressID |
Mohamma Tariq Ansari
2009-10-22 |
How to write recursive function in SQL Server I am getting problem in make a function for the following situation. Suppose i have a table named Opportunity_Duplicate_Tbl Opp_ID Dup_ID ================== R1 111 200 R2 111 300 R3 111 400 R4 222 56 R5 300 200 R6 101 400 R7 112 50 R8 100 101 R1,R2.... are just row number and not part of actual data. Here we have two cols Opp_ID and Dup_ID. Here 200,300,400 are duplicates of 111. Total Duplicates of 111 are 3 right now. But 200 and 400 are duplicates of 300 and 101 respectively. Think relation like a=b and b=c, then c=a or a=c also. So total duplicates of 111 are 5.As 101 is duplicate of 100 and 400 is duplicate of 101.. we have ommited the R5 row since 200 and 300 are direct duplicates of 111. So 200,300,400,101 and 100 are total duplicates of 111. I have solve the problem of getting total number of "Dup_ID" for each "Opp_ID".Please help me out. Thanks in advance. |
gopal
2009-10-26 |
re: SQL Server - How to get the whole group of duplicate rows select Firstname,LastName,Deptno from Empinfo where Deptno=(select Deptno from Empinfo group by Deptno having count(*)>1 ) |