I want some Moore

Blog about stuff and things and stuff. Mostly about SQL server and .Net
posts - 218, comments - 2281, trackbacks - 33

My Links

Advertisement

News

Hi! My name is 
Mladen Prajdić  I'm from Slovenia and I'm currently working as a .Net (C#) and SQL Server developer.

I also speak at local user group meetings and conferences like SQLBits and NT Conference
Welcome to my blog.
SQL Server MVP

My Books

SQL Server MVP Deep Dives 2
The Red Gate Guide to SQL Server Team based Development Free e-book

My Blog Feed via Email
Follow MladenPrajdic on Twitter


Users Online: who's online

Article Categories

Archives

Post Categories

Cool software

Other Blogs

Other stuff

SQL stuff

SQL Server - How to get the whole group of duplicate rows

We all know how to find only duplicated rows in a table.

Since SQL Server 2005 this became really simple using a ROW_NUMBER() window function like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them    
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RN,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   RN > 1
ORDER BY AddressLine1, City, AddressID
 

The IO for this method is great. We get only one pass through the table:

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 

However the problem with this code is that it returns only the duplicated rows and not the original row the duplicates originated from.

Good way

We want a fast way of getting the whole duplicated group. My first attempt was to use the Count(*) with OVER to get the group count in each row like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  COUNT(*) OVER(PARTITION BY AddressLine1, City ) AS CNT,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   CNT > 1
ORDER BY AddressLine1, City, AddressID
 

But looking at the IO this caused made my head explode and i wanted a better way.

Table 'Worktable'. Scan count 3, logical reads 117473, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Better way

We could get the same thing by using 2 ROW_NUMBER()-ed column, one in ascending and the other on descending order like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID DESC) AS RND,
            ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID)      AS RNA,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
-- this condition removes the rows that don’t have duplicates 
-- RNA-RND = 0 gets all odd rows in the group: 1st, 3rd, etc...
-- RNA = 1 AND RND = 1 limit those rows to those groups that don’t have duplicates 
-- because if there’s only one row both RNA and RND will be 1
-- by negating the condition we return the whole duplicated group
WHERE NOT (RNA-RND = 0 AND RNA = 1 AND RND = 1)
ORDER BY AddressLine1, City, AddressID
 

This made the IO a normal single pass at the table.

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 

 

If anyone has a better way do let me know.

kick it on DotNetKicks.com

Print | posted on Thursday, October 15, 2009 1:53 PM | Filed Under [ SQL Server ]

Feedback

Gravatar

# re: SQL Server - How to get the whole group of duplicate rows

You can't get less logical reads than that, but you can do a lot for the CPU overhead.
In my tests, this cuts the CPU in half

WHERE RNA + RND > 2
10/15/2009 4:14 PM | Peso
Gravatar

# re: SQL Server - How to get the whole group of duplicate rows

How much IO does this old fashioned way use?
USE AdventureWorks
GO
SELECT *
FROM Person.Address A
INNER JOIN (SELECT AddressLine1, City
FROM Person.[Address]
GROUP BY AddressLine1, City
HAVING COUNT(*) > 1) AS A1
ON A.AddressLine1 = A1.AddressLine1
AND A.City = A1.City
ORDER BY A.AddressLine1, A.City, A.AddressID
10/18/2009 11:22 AM | Jonathan R
Gravatar

# How to write recursive function in SQL Server

I am getting problem in make a function for the following situation.

Suppose i have a table named Opportunity_Duplicate_Tbl

Opp_ID Dup_ID
==================
R1 111 200
R2 111 300
R3 111 400
R4 222 56
R5 300 200
R6 101 400
R7 112 50
R8 100 101

R1,R2.... are just row number and not part of actual data.
Here we have two cols Opp_ID and Dup_ID.
Here 200,300,400 are duplicates of 111. Total Duplicates of 111 are 3 right now.
But 200 and 400 are duplicates of 300 and 101 respectively.
Think relation like a=b and b=c, then c=a or a=c also.
So total duplicates of 111 are 5.As 101 is duplicate of 100 and 400 is duplicate of 101.. we have ommited the R5 row since 200 and 300 are direct duplicates of 111.

So 200,300,400,101 and 100 are total duplicates of 111.
I have solve the problem of getting total number of "Dup_ID" for each "Opp_ID".Please help me out.
Thanks in advance.
10/22/2009 11:40 AM | Mohamma Tariq Ansari
Gravatar

# re: SQL Server - How to get the whole group of duplicate rows

select Firstname,LastName,Deptno from Empinfo where Deptno=(select Deptno from Empinfo group by Deptno having count(*)>1 )
10/26/2009 3:00 PM | gopal
Comments have been closed on this topic.

Powered by:
Powered By Subtext Powered By ASP.NET