Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

Delete Duplicates And Resolve Foreign Key References in SQL

Here's a scenario many of us have encountered:

You inherit a database.  TableX has an identity column "ID" set up as the primary key, and contains many duplicates.  It is clear a better, natural primary key is a combination of columns A and B.  Many other tables have foreign key references to TableX through the ID column.   You'd like to delete duplicates and clean up TableX, but how do you resolve those foreign keys to their "new" ID's ?  For example, if ID's 3,4 and 6 are all duplicates, and you decide to keep only 3, you must first repoint all foreign tables referencing ID's 4 and 6 to reference ID 3 before you safely remove the duplicate rows.

Step 1 -- Double-check your Natural Key

It's very important to make absolutely sure that columns A and B truly are a good primary key candidate.  Create a report off a SELECT similiar to this:

SELECT t1.MinID, t2.*
FROM
  (SELECT MIN(ID) as MinID, A, B
   FROM TableX
   GROUP BY A,B) t1
INNER JOIN
  TableX t2 ON t1.A = t2.A AND t1.B = t2.B
ORDER BY t1.MinID

For each "MinID", verify that the those rows returned are true duplicates.  Scrub the data (e.g., make sure "Jeff" matches "Jeffrey") as necessary until you are please that this report produces the desired results.  If it's not even close, then those two columns will probably not be good candidates for your natural key.

(By the way, if you have MS Access or Crystal Reports or even Excel handy, do yourself a favor and don't try to generate this report using a combination of Query Analyzer and Notepad-- use the proper tools for the job.  But that's a whole 'nother blog I suppose ....)

Step 2 -- Decide on New Primary Key vs. Unique Constraint

Once you have scrubbed the data and determined that columns A and B form a good key, you have a choice to make:  Update all foreign tables to reference TableX by columns A and B, or leave ID as the primary key and add a unique constraint on columns A,B.  In many cases, it will be easier to choose the second option.  In my opinion, while it's not ideal, it's OK to leave that identity as your PK if you must, as long as you add a true natural unique constraint on the logical key of your table.  It does make it much easier if you have many tables referencing TableX, because if you alter the PK, you will need to alter all of those other tables.  Often, this is not a desirable (or possible) option.

Step 3 -- Create a Translation Table

If we are going to delete duplicate rows, we need to map all of the rows we are about to delete to the rows we are going to keep.  Similiar to the SELECT we ran earlier, we need to generate a temporary table that will map old ID's to the new ID's.  It should not be important which ID we will keep, so we will always keep the MIN() ID per group of matching rows. All others will be deleted.

Generate a temporary "Translation" table using a method similiar to this:

SELECT t2.ID as OldID, t1.MinID as NewID
INTO #Translation
FROM
  (SELECT MIN(ID) as MinID, A, B
   FROM TableX
   GROUP BY A,B) t1
INNER JOIN
  TableX t2 ON t1.A = t2.A AND t1.B = t2.B

Notice the above is very similiar to the first SELECT we ran earlier.  All we need here are two columns -- OldID and NewID.

Step 4 -- Update all Foreign Key References

Now that we have our translation table, you must identify all tables that reference TableX.  You can use sp_depends to help you out with this.

Once you have that list, you must then run an UPDATE on each of these tables like this:

UPDATE TableY
SET TableY.TableXID = NewID
FROM TableY
INNER JOIN #Translation T
ON TableY.TableXID = T.OldID
WHERE T.OldID <> T.NewID

You must run an update similiar to this for all of the tables and columns that you can find.  After this is done, all foreign key references to TableX will only reference the Minimum ID per group of duplicates.  The rest of the rows -- the ones we are going to delete -- should no longer have any foreign references to their ID's.

Note that even if your foreign key references are set to CASCADE updates, that will not help you in this scenario because you are not updating the duplicate rows -- you are deleting them completely.

Step 5 -- Remove the Duplicates

Now that we have taken care of everything else, we can delete the duplicate rows.  One way to do it is like this:

DELETE FROM TableX
WHERE TableX.ID Not IN (SELECT NewID FROM #Translation)

If the DELETE doesn't work, then you still have foreign key references to fix  -- go back to step 4 and double-check everything.

Step 6 -- Add a Unique Constraint

We're not done yet! We need to avoid this mess in the future by adding a UNIQUE CONSTRAINT on columns A and B:

ALTER TABLE TableX
ADD CONSTRAINT TableX_UC UNIQUE (A,b)

Assuming nothing in your table has changed since you generated the #Translation table and you have properly completed step 5, this should work with no problems.  As mentioned earlier, we are taking the approach here of leaving ID as the primary key of the table, but also adding this unique constraint for data integrity purposes.

----

That should do it -- your data is scrubbed, duplicates are removed, foreign key references are resolved, and you are protected from duplicate data in the future.   

Now you can get to work on normalizing all the tables and removing all the dynamic SQL and cursors !

see also:

Legacy Comments


rudy
2004-10-28
re: Deleting Duplicates And Resolving Foreign Key References
very nice article

*bookmarked*

Hani Varoqua
2006-01-18
re: Deleting Duplicates And Resolving Foreign Key References
Excellent and very helpful.

saurabh
2008-12-02
How to delete data from both table, which have foreignkey relation with nnother one
help!.......................

Land
2009-09-09
re: Delete Duplicates And Resolve Foreign Key References in SQL
I will fuck your sister Gandoooooooo