Relational division

Wed Jun 30, 2010 by Peter Larsson in optimization, sql-server-2008, algorithms, sql-server-2005, sql-server-2000

I came across an interesting post on Microsoft SQL Server forum this afternoon. It was a question about Relational algebra and the poster wanted to have an efficient query to solve his problem.
The problem could be solved with relational division, but there is no such operator in SQL Server. Maybe there will be some day.
For a fully working solution, see http://weblogs.sqlteam.com/peterl/archive/2010/07/02/Proper-Relational-Division-With-Sets.aspx

But for now there is no such operator, so we as developers have to find our own ways.
First prepare and populate some sample data

-- Prepare sample data
DECLARE @Sample TABLE
        (
            ParentID INT NOT NULL,
            Keyword VARCHAR(25) NOT NULL,
            UNIQUE (ParentID, Keyword)
        )

-- Populate sample data
INSERT @Sample
        (
            ParentID,
            Keyword
        )
VALUES (1, 'one'),
        (1, 'two'),
        (1, 'three'),
        (1, 'four'),
        (2, 'one'),
        (2, 'two'),
        (2, 'three'),
        (3, 'one'),
        (3, 'two')

People had already been active and posted some solutions, of which this common query was present.

SELECT     s.ParentID
FROM        @Sample AS s
WHERE       s.Keyword IN ('one', 'two', 'three')
GROUP BY    s.ParentID
HAVING     COUNT(DISTINCT s.Keyword) = 3
            AND COUNT(DISTINCT s.Keyword) = (SELECT COUNT(*) FROM @Sample AS x WHERE x.ParentID = s.ParentID)

and this type of query

SELECT      s.ParentID
FROM        @Sample AS s
WHERE       s.Keyword IN ('one', 'two', 'three')
            AND NOT EXISTS (
                            SELECT *
                            FROM    @Sample AS x
                            WHERE   x.ParentID = s.ParentID
                                    AND x.Keyword NOT IN ('one', 'two', 'three')
                           )
GROUP BY    s.ParentID
HAVING      COUNT(DISTINCT s.Keyword) = 3

And even a XML query!

;WITH AggStr
AS (
        SELECT ParentId,
                (
                    SELECT      CAST(',' AS VARCHAR(MAX)) + c.Keyword
                    FROM        @Sample AS c
                    WHERE       c.ParentID = p.ParentID
                    ORDER BY    c.Keyword
                    FOR XML     PATH('')
                ) AS c1
        FROM    (
                    SELECT DISTINCT
                            ParentID
                    FROM    @Sample
                ) AS p
)
SELECT ParentID
FROM    AggStr
WHERE   c1 = ',one,three,two'

The good thing is that all three produce the same wanted result but the bad thing is the inefficient execution plans. Then one poster did his homework and read about Mr Celko and translated his algorithm to the current problem, and then the query looked like this

SELECT      ParentID
FROM        (
                SELECT ParentID,
                        Keyword,
                        COUNT(*) OVER (PARTITION BY ParentID) AS cnt
                FROM    @Sample
            ) AS w
WHERE       Keyword IN ('one', 'two', 'three')
            AND cnt = 3
GROUP BY    ParentID
HAVING      COUNT(cnt) = 3

With these queries in mind, I thought about the problem and realized the problem did in fact have a much simpler solution.
The query I came up with is the simplest of them all, and just does one pass of the source table. Yes, only one pass just as the first Celko query for relational division, but without the internal worktable.
This is the query I came up with

-- Peso
SELECT      ParentID
FROM       @Sample
GROUP BY    ParentID
HAVING      MIN(CASE WHEN Keyword IN ('one', 'two', 'three') THEN 1 ELSE 0 END) = 1
            AND SUM(CASE WHEN Keyword IN ('one', 'two', 'three') THEN 1 ELSE 0 END) = 3

How does the query work? The second aggregation filtering just makes sure all three keywords are present.
But the first aggregation filter? What does it do? To simplify, I just write that it takes care of the modulo part of the relational division. There cannot be a "fractional" part of the relational division, because it means that particular ParentID has more keywords than wanted.

Simple as that.

//Peso

PS. These are the textual execution plans for the four types of queries and then mine.

|--Filter(WHERE:([Expr1003]=CASE WHEN [Expr1007] IS NULL THEN (0) ELSE [Expr1007] END))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([s].[ParentID]))
            |--Filter(WHERE:([Expr1003]=(3)))
            |    |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1014],0)))
            |         |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1014]=Count(*)))
            |              |--Index Scan(OBJECT:(@Sample AS [s]), WHERE:(@Sample.[Keyword] as [s].[Keyword]='one' OR @Sample.[Keyword] as [s].[Keyword]='three' OR @Sample.[Keyword] as [s].[Keyword]='two') ORDERED FORWARD)
            |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1015],0)))
                 |--Stream Aggregate(DEFINE:([Expr1015]=Count(*)))
                      |--Index Seek(OBJECT:(@Sample AS [x]), SEEK:([x].[ParentID]=@Sample.[ParentID] as [s].[ParentID]) ORDERED FORWARD)

|--Filter(WHERE:([Expr1007]=(3)))
       |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1010],0)))
            |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1010]=Count(*)))
                 |--Nested Loops(Left Anti Semi Join, OUTER REFERENCES:([s].[ParentID]))
                      |--Index Scan(OBJECT:(@Sample AS [s]), WHERE:(@Sample.[Keyword] as [s].[Keyword]='one' OR @Sample.[Keyword] as [s].[Keyword]='three' OR @Sample.[Keyword] as [s].[Keyword]='two') ORDERED FORWARD)
                      |--Index Seek(OBJECT:(@Sample AS [x]), SEEK:([x].[ParentID]=@Sample.[ParentID] as [s].[ParentID]), WHERE:(@Sample.[Keyword] as [x].[Keyword]<>'one' AND @Sample.[Keyword] as [x].[Keyword]<>'three' AND @Sample.[Keyword] as [x].[Keyword]<>'two') ORDERED FORWARD)

|--Filter(WHERE:([Expr1008]=N',one,three,two'))
       |--Nested Loops(Inner Join, OUTER REFERENCES:([ParentID]))
            |--Stream Aggregate(GROUP BY:([ParentID]))
            |    |--Index Scan(OBJECT:(@sample), ORDERED FORWARD)
            |--UDX(([Expr1007], [C].[Keyword]))
                 |--Compute Scalar(DEFINE:([Expr1007]=CONVERT(varchar(max),',',0)+@sample.[Keyword] as [C].[Keyword]))
                      |--Index Seek(OBJECT:(@sample AS [C]), SEEK:([C].[ParentID]=[ParentID]) ORDERED FORWARD)

   |--Filter(WHERE:([Expr1005]=(3)))
       |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1010],0)))
            |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1010]=Count(*)))
                 |--Filter(WHERE:(([Keyword]='one' OR [Keyword]='three' OR [Keyword]='two') AND [Expr1004]=(3)))
                      |--Nested Loops(Inner Join)
                           |--Table Spool
                           |    |--Segment
                           |         |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)
                           |--Nested Loops(Inner Join, WHERE:((1)))
                                |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1009],0)))
                                |    |--Stream Aggregate(DEFINE:([Expr1009]=Count(*)))
                                |         |--Table Spool
                                |--Table Spool

|--Filter(WHERE:([Expr1004]=(1) AND [Expr1005]=(3)))
       |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1004]=MIN([Expr1006]), [Expr1005]=SUM([Expr1006])))
            |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Keyword]='three' OR [Keyword]='two' OR [Keyword]='one' THEN (1) ELSE (0) END))
                 |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)

Legacy Comments

Joe Celko
2010-07-01

re: Relational division
exec [News_GetRedirectsForCampaignFilterByDate_testing_2] @startDate='2010-01-24 00:00:00',@endDate='2010-06-25 00:00:00',@LandingPageABID=7
exec [News_GetRedirectsForCampaignFilterByDate_testing] @startDate='2010-01-24 00:00:00',@endDate='2010-06-25 00:00:00',@LandingPageABID=7
exec sp_News_GetRedirectsForCampaignFilterByDate @startDate='2010-01-24 00:00:00',@endDate='2010-06-25 00:00:00',@LandingPageABID=7

-- Prepare sample data
CREATE TABLE Dividend
(group_id INTEGER NOT NULL,
item_name VARCHAR(10) NOT NULL,
PRIMARY KEY (group_id, item_name)
);
GO

-- Populate sample data
INSERT INTO Dividend (group_id, item_name)
VALUES (1, 'one'),
(1, 'two'),
(1, 'three'),
(1, 'four'),
(2, 'one'),
(2, 'two'),
(2, 'three'),
(3, 'one'),
(3, 'two');
GO

CREATE TABLE Divisor
(item_name VARCHAR(10) NOT NULL PRIMARY KEY);
GO
-- Populate sample data
INSERT INTO Divisor(item_name)
VALUES ('one'), ('two'), ('three');
GO

SELECT D1.group_id
FROM Dividend AS D1
WHERE D1.item_name IN (SELECT item_name FROM Divisor)
GROUP BY D1.group_id
HAVING COUNT(DISTINCT D1.item_name)
= (SELECT COUNT(*) FROM Divisor)
AND COUNT(DISTINCT D1.item_name)
= (SELECT COUNT(*)
FROM Dividend AS D2
WHERE D2.group_id = D1.group_id);

SELECT D1.group_id
FROM Dividend AS D1
WHERE D1.item_name IN (SELECT item_name FROM Divisor)
AND NOT EXISTS
(SELECT *
FROM Dividend AS D2
WHERE D2.group_id = D1.group_id
AND D2.item_name NOT IN (SELECT item_name FROM Divisor)
)
GROUP BY D1.group_id
HAVING COUNT(DISTINCT D1.item_name)
= (SELECT COUNT(*) FROM Divisor)

SELECT D1.group_id
FROM (SELECT group_id, item_name,
COUNT(*) OVER (PARTITION BY group_id) AS cnt
FROM Dividend
) AS D1
WHERE D1.item_name
IN (SELECT item_name FROM Divisor)
AND cnt = (SELECT COUNT(*) FROM Divisor)
GROUP BY D1.group_id
HAVING COUNT(D1.cnt) = (SELECT COUNT(*) FROM Divisor);

SELECT group_id
FROM Dividend
GROUP BY group_id
HAVING MIN(CASE WHEN item_name
IN ('one', 'two', 'three')
THEN 1 ELSE 0 END) = 1
AND SUM(CASE WHEN item_name
IN ('one', 'two', 'three')
THEN 1 ELSE 0 END) = 3;

-- fails because of an aggregate or a subquery inside an aggregate function.
SELECT group_id
FROM Dividend
GROUP BY group_id
HAVING MIN(CASE WHEN item_name
IN (SELECT item_name FROM Divisor)
THEN 1 ELSE 0 END) = 1
AND SUM(CASE WHEN item_name
IN (SELECT item_name FROM Divisor)
THEN 1 ELSE 0 END) = (SELECT COUNT(*) FROM Divisor);

-- fails because of an aggregate or a subquery inside an aggregate function.

;WITH Divisor2
AS
(SELECT group_id, MIN(CASE WHEN item_name
IN (SELECT item_name FROM Divisor)
THEN 1 ELSE 0 END)
OVER(PARTITION BY group_id)
AS single,
SUM(CASE WHEN item_name
IN (SELECT item_name FROM Divisor)
THEN 1 ELSE 0 END)
OVER(PARTITION BY group_id)
AS full_basket
FROM Dividend)
SELECT D.group_id
FROM Dividend AS D, Divisor2
WHERE D.group_id = Divisor2.group_id
AND Divisor2.single = 1
AND Divisor2.full_basket = (SELECT COUNT(*) FROM Divisor)
GROUP BY D.group_id;