# Thinking outside the box

Patron Saint of Lost Yaks

## Delete all subset records

Today I helped a guy out with a special request. His goal was to remove all records that are subsets or parts of another record, a superset record.

See this sample data

DECLARE       @Sample TABLE
(
recID INT IDENTITY(1, 1),
col1 VARCHAR(1),
col2 VARCHAR(2),
col3 VARCHAR(3),
userID INT
)

INSERT @Sample
SELECT 'A', 'B', 'C', 1 UNION ALL
SELECT 'A', 'B', ' ', 1 UNION ALL
SELECT 'A', ' ', 'C', 1 UNION ALL
SELECT 'F', ' ', 'C', 1 UNION ALL
SELECT ' ', 'M', ' ', 2 UNION ALL
SELECT 'T', 'M', 'O', 2 UNION ALL
SELECT ' ', 'M', 'O', 2 UNION ALL
SELECT 'X', 'M', 'O', 2 UNION ALL
SELECT ' ', ' ', 'Y', 3 UNION ALL
SELECT ' ', 'X', 'Y', 3

SELECT *
FROM   @Sample

recID col1   col2   col3   userID
1      A      B      C      1
2      A      B             1
3      A             C      1
4      F             C      1
5             M             2
6      T      M      O      2
7             M      O      2
8      X      M      O      2
9                    Y      3
10             X      Y      3

Records 2 and 3 are subsets of Record 1, so they should be deleted.
Record 5 is a subset of Records 6, 7 and 8, so it should be deleted.
Record 7 is a subset of Records 6 and 8, so it should be deleted.
Record 9 is a subset of Record 10, so it should be deleted.

Remember that records are to be kept with userID.

This is my suggestion how deal with all subset deletes.

DELETE        s
FROM          (
SELECT recID,
col1,
col2,
col3,
userID,
CASE WHEN col1 = '' THEN 0 ELSE 1 END
+ CASE WHEN col2 = '' THEN 0 ELSE 1 END
+ CASE WHEN col3 = '' THEN 0 ELSE 1 END AS theItems
FROM   @Sample
) AS s
INNER JOIN    (
SELECT       MIN(recID) AS minID,
col1,
col2,
col3,
userID,
CASE WHEN col1 = '' THEN 0 ELSE 1 END
+ CASE WHEN col2 = '' THEN 0 ELSE 1 END
+ CASE WHEN col3 = '' THEN 0 ELSE 1 END AS theItems
FROM         @Sample
GROUP BY     col1,
col2,
col3,
userID,
CASE WHEN col1 = '' THEN 0 ELSE 1 END
+ CASE WHEN col2 = '' THEN 0 ELSE 1 END
+ CASE WHEN col3 = '' THEN 0 ELSE 1 END
) AS w ON w.userID = s.userID
AND w.minID <> s.recID
AND w.theItems >= s.theItems
WHERE         w.col1 = COALESCE(NULLIF(s.col1, ''), w.col1)
AND w.col2 = COALESCE(NULLIF(s.col2, ''), w.col2)
AND w.col3 = COALESCE(NULLIF(s.col3, ''), w.col3)

SELECT *
FROM   @Sample

Print | posted on Wednesday, October 15, 2008 5:26 PM | Filed Under [ Optimization SQL Server 2008 Algorithms SQL Server 2005 SQL Server 2000 ]

## #re: Delete all subset records

I guess I'm missing something...

DELETE a FROM @Sample a WHERE EXISTS (SELECT * FROM @Sample b WHERE
a.userID = b.userID AND a.recID <> b.recID
AND (a.col1 = b.col1 OR a.col1 = '')
AND (a.col2 = b.col2 OR a.col2 = '')
AND (a.col3 = b.col3 OR a.col3 = ''))

SELECT *
FROM @Sample

BTW - your comment about 9 and 10 should be the other way round I think.
10/15/2008 6:46 PM | Casual Observer

## #re: Delete all subset records

Fixed. Thank you.

The reason with using the "sum of value columns" is to have a triangular join ( n * m / 2) instead of a cross join (n * m).
10/15/2008 8:14 PM |

## #re: Delete all subset records

However, your suggestion needs to be changed to this

DELETE a
FROM @Sample AS a
INNER JOIN (
SELECT MIN(recID) AS recID,
col1,
col2,
col3,
userID
FROM @Sample
GROUP BY col1,
col2,
col3,
userID
) AS b ON b.userID = a.userID
AND b.recID <> a.recID
WHERE (a.col1 = b.col1 OR a.col1 = '')
AND (a.col2 = b.col2 OR a.col2 = '')
AND (a.col3 = b.col3 OR a.col3 = '')

for taking care of pure duplicates too.
10/15/2008 10:09 PM |
Comments have been closed on this topic.