Thinking outside the box

Patron Saint of Lost Yaks
posts - 199, comments - 687, trackbacks - 4

My Links

Advertisement

News

Archives

Post Categories

Friday, July 02, 2010

Proper Relational Division With Sets

I got an email from Mr Celko and he correctly stated that my previous solution was not truly working with sets, so he posted me some solutions and proper sample data.
With this information at my hand, I started to investigate what really is needed to get this Relational Division to work properly with two sets; Dividend and Divisor.

Some of you know me well, and know I am not satisfied with just solving the problem. There have to be some tweaks, and I did that too with this solution. Not only is it only touching the Dividend table once and Divisor table once, you can also set if you want a division with no remainder (which means all records in Divisor should match and not a single record more), or allow a division with remainder (which means all the records should match and maybe more records).
Great? Just set 1 for "No remainder" and 0 for "Allow remainder".

Simple as that. So why does it work? Remember your old algebra? "Divide is the same thing as multiply with the inverse number..."


Now for the sample data (courtesy of Mr Celko)


CREATE TABLE    dbo.Dividend
                (
                    group_id INTEGER NOT NULL,
                    item_name VARCHAR(10) NOT NULL,
                    PRIMARY KEY (
                                    group_id,
                                    item_name
                                )
                )

INSERT INTO     dbo.Dividend
                (
                    group_id,
                    item_name
                )
VALUES          (1, 'one'),
                (1, 'two'),
                (1, 'three'),
                (1, 'four'),
                (2, 'one'),
                (2, 'two'),
                (2, 'three'),
                (3, 'one'),
                (3, 'two')

CREATE TABLE    dbo.Divisor
                (
                    item_name VARCHAR(10) NOT NULL PRIMARY KEY
                )

INSERT INTO     dbo.Divisor
                (
                    item_name
                )
VALUES          ('one'),
                ('two'),
                ('three')


Now for the 4 solutions posted by Mr Celko


-- Celko 1
SELECT      D1.group_id
FROM        Dividend AS D1
WHERE       D1.item_name IN (SELECT item_name FROM Divisor)
GROUP BY    D1.group_id
HAVING      COUNT(DISTINCT D1.item_name) = (SELECT COUNT(*) FROM Divisor)
            AND COUNT(DISTINCT D1.item_name) = (SELECT COUNT(*) FROM Dividend AS D2 WHERE D2.group_id = D1.group_id)

-- Celko 2
SELECT      D1.group_id
FROM        Dividend AS D1
WHERE       D1.item_name IN (SELECT item_name FROM Divisor)
            AND NOT EXISTS  (
                                SELECT  *
                                FROM    Dividend AS D2
                                WHERE   D2.group_id = D1.group_id
                                        AND D2.item_name NOT IN (SELECT item_name FROM Divisor)
                            )
GROUP BY    D1.group_id
HAVING      COUNT(DISTINCT D1.item_name) = (SELECT COUNT(*) FROM Divisor)

-- Celko 3
SELECT      D1.group_id
FROM        (
                SELECT  group_id,
                        item_name,
                        COUNT(*) OVER (PARTITION BY group_id) AS cnt
                FROM    Dividend
            ) AS D1
WHERE       D1.item_name IN (SELECT item_name FROM Divisor)
            AND cnt = (SELECT COUNT(*) FROM Divisor)
GROUP BY    D1.group_id
HAVING      COUNT(D1.cnt) = (SELECT COUNT(*) FROM Divisor)

--Celko 4
;WITH Divisor2
AS (
        SELECT  group_id,
                MIN(CASE WHEN item_name IN (SELECT item_name FROM Divisor) THEN 1 ELSE 0 END) OVER(PARTITION BY group_id) AS single,
                SUM(CASE WHEN item_name IN (SELECT item_name FROM Divisor) THEN 1 ELSE 0 END) OVER(PARTITION BY group_id) AS full_basket
        FROM    Dividend
)
SELECT      D.group_id
FROM        Dividend AS D,
            Divisor2
WHERE       D.group_id = Divisor2.group_id
            AND Divisor2.single = 1
            AND Divisor2.full_basket = (SELECT COUNT(*) FROM Divisor)
GROUP BY    D.group_id


You can copy and paste the code to a query window and run them. Investigate the execution plan and compare the 4 of them.
And now to my solution.


-- Peso 1
SELECT      group_id
FROM        (
                SELECT      t.group_id,
                            SUM(CASE WHEN t.item_name = n.item_name THEN 1 ELSE 0 END) AS cnt,
                            COUNT(*) AS Items
                FROM        dbo.Dividend AS t
                CROSS JOIN  dbo.Divisor AS n
                GROUP BY    t.group_id,
                            t.item_name
            ) AS d
GROUP BY    group_id
HAVING      SUM(cnt) = MIN(Items)
            AND MIN(cnt) >= 1    -- 1 means no remainder, 0 means remainder


After some challenging with MVP Adam Machanic, here is another version


-- Peso v2
SELECT      t.group_id
FROM        (
                SELECT      group_id,
                            COUNT(*) AS cnt
                FROM        dbo.Dividend
                GROUP BY    group_id
            ) AS kc
INNER JOIN  (
                SELECT  COUNT(*) AS cnt
                FROM    dbo.Divisor
            ) AS nc ON nc.cnt = kc.cnt
INNER JOIN  dbo.Dividend AS t ON t.group_id = kc.group_id
INNER JOIN  dbo.Divisor AS n ON n.item_name = t.item_name
GROUP BY    t.group_id
HAVING      COUNT(*) = MIN(nc.cnt)


Here is an algorithm (exact division) which is really fast, but not 100% accurate due to the implementation of CHECKSUM (see http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832).


-- Peso 3
SELECT
      group_id
FROM        (
                SELECT      group_id,
                            CHECKSUM_AGG(CHECKSUM(item_name)) AS ca1,
                            CHECKSUM_AGG(CHECKSUM(REVERSE(item_name))) AS ca2
                FROM        dbo.Dividend
                GROUP BY    group_id
            ) AS t
INNER JOIN  (
                SELECT  CHECKSUM_AGG(CHECKSUM(item_name)) AS ca1,
                        CHECKSUM_AGG(CHECKSUM(REVERSE(item_name))) AS ca2
                FROM    dbo.Divisor
            ) AS n ON n.ca1 = t.ca1
                AND n.ca2 = t.ca2


Now copy my solutions and compare them to the other 4.


-- Celko 1
Table 'Dividend'. Scan count 3, logical reads 6.
Table 'Divisor'. Scan count 2, logical reads 20.

-- Celko 2
Table 'Divisor'. Scan count 3, logical reads 32.
Table 'Dividend'. Scan count 4, logical reads 8.

-- Celko 3
Table 'Divisor'. Scan count 3, logical reads 10.
Table 'Worktable'. Scan count 3, logical reads 31.
Table 'Dividend'. Scan count 1, logical reads 2.

-- Celko 4
Table 'Dividend'. Scan count 2, logical reads 9.
Table 'Worktable'. Scan count 3, logical reads 31.
Table 'Divisor'. Scan count 1, logical reads 38.

-- Peso 1
Table 'Divisor'. Scan count 1, logical reads 19.
Table 'Dividend'. Scan count 1, logical reads 2.

-- Peso 2
Table 'Divisor'. Scan count 1, logical reads 8.
Table 'Dividend'. Scan count 2, logical reads 4.

-- Peso 3

Table 'Divisor'. Scan count 1, logical reads 2.
Table 'Dividend'. Scan count 1, logical reads 2.

So it seems my solution is cleaner and faster than the previous existing. But the best thing is yet hidden. My solution cares for multi-column division (just expand the CASE and GROUP BY clauses) whereas the previous 4 do not. Well, not easily anyway.
It will involve some replacing for IN with EXISTS, and some string concatenation for the DISTINCT clauses.


//Peso

posted @ Friday, July 02, 2010 12:48 AM | Feedback (36) | Filed Under [ Optimization SQL Server 2008 Algorithms SQL Server 2005 SQL Server 2000 ]

Powered by:
Powered By Subtext Powered By ASP.NET