Thinking outside the box

Patron Saint of Lost Yaks
posts - 199, comments - 687, trackbacks - 4

My Links

Advertisement

News

Archives

Post Categories

Wednesday, June 30, 2010

Relational division

I came across an interesting post on Microsoft SQL Server forum this afternoon. It was a question about Relational algebra and the poster wanted to have an efficient query to solve his problem.
The problem could be solved with relational division, but there is no such operator in SQL Server. Maybe there will be some day.
For a fully working solution, see http://weblogs.sqlteam.com/peterl/archive/2010/07/02/Proper-Relational-Division-With-Sets.aspx

But for now there is no such operator, so we as developers have to find our own ways.
First prepare and populate some sample data


-- Prepare sample data
DECLARE @Sample TABLE
        (
            ParentID INT NOT NULL,
            Keyword VARCHAR(25) NOT NULL,
            UNIQUE (ParentID, Keyword)
        )

-- Populate sample data
INSERT  @Sample
        (
            ParentID,
            Keyword
        )
VALUES  (1, 'one'),
        (1, 'two'),
        (1, 'three'),
        (1, 'four'),
        (2, 'one'),
        (2, 'two'),
        (2, 'three'),
        (3, 'one'),
        (3, 'two')


People had already been active and posted some solutions, of which this common query was present.


SELECT      s.ParentID
FROM        @Sample AS s
WHERE       s.Keyword IN ('one', 'two', 'three')
GROUP BY    s.ParentID
HAVING      COUNT(DISTINCT s.Keyword) = 3
            AND COUNT(DISTINCT s.Keyword) = (SELECT COUNT(*) FROM @Sample AS x WHERE x.ParentID = s.ParentID)


and this type of query


SELECT      s.ParentID
FROM        @Sample AS s
WHERE       s.Keyword IN ('one', 'two', 'three') 
            AND NOT EXISTS (
                            SELECT  *
                            FROM    @Sample AS x
                            WHERE   x.ParentID = s.ParentID
                                    AND x.Keyword NOT IN ('one', 'two', 'three')
                           )
GROUP BY    s.ParentID
HAVING      COUNT(DISTINCT s.Keyword) = 3


And even a XML query!


;WITH AggStr
AS (
        SELECT  ParentId,
                (
                    SELECT      CAST(',' AS VARCHAR(MAX)) + c.Keyword
                    FROM        @Sample AS c
                    WHERE       c.ParentID = p.ParentID
                    ORDER BY    c.Keyword
                    FOR XML     PATH('')
                ) AS c1
        FROM    (
                    SELECT  DISTINCT
                            ParentID
                    FROM    @Sample
                ) AS p
)
SELECT  ParentID
FROM    AggStr
WHERE   c1 = ',one,three,two'


The good thing is that all three produce the same wanted result but the bad thing is the inefficient execution plans. Then one poster did his homework and read about Mr Celko and translated his algorithm to the current problem, and then the query looked like this


SELECT      ParentID
FROM        (
                SELECT  ParentID,
                        Keyword,
                        COUNT(*) OVER (PARTITION BY ParentID) AS cnt
                FROM    @Sample
            ) AS w
WHERE       Keyword IN ('one', 'two', 'three')
            AND cnt = 3
GROUP BY    ParentID
HAVING      COUNT(cnt) = 3


With these queries in mind, I thought about the problem and realized the problem did in fact have a much simpler solution.
The query I came up with is the simplest of them all, and just does one pass of the source table. Yes, only one pass just as the first Celko query for relational division, but without the internal worktable.
This is the query I came up with


-- Peso
SELECT      ParentID
FROM        @Sample
GROUP BY    ParentID
HAVING      MIN(CASE WHEN Keyword IN ('one', 'two', 'three') THEN 1 ELSE 0 END) = 1
            AND SUM(CASE WHEN Keyword IN ('one', 'two', 'three') THEN 1 ELSE 0 END) = 3


How does the query work? The second aggregation filtering just makes sure all three keywords are present.
But the first aggregation filter? What does it do? To simplify, I just write that it takes care of the modulo part of the relational division. There cannot be a "fractional" part of the relational division, because it means that particular ParentID has more keywords than wanted.

Simple as that.

//Peso


PS. These are the textual execution plans for the four types of queries and then mine.


  |--Filter(WHERE:([Expr1003]=CASE WHEN [Expr1007] IS NULL THEN (0) ELSE [Expr1007] END))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([s].[ParentID]))
            |--Filter(WHERE:([Expr1003]=(3)))
            |    |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1014],0)))
            |         |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1014]=Count(*)))
            |              |--Index Scan(OBJECT:(@Sample AS [s]),  WHERE:(@Sample.[Keyword] as [s].[Keyword]='one' OR @Sample.[Keyword] as [s].[Keyword]='three' OR @Sample.[Keyword] as [s].[Keyword]='two') ORDERED FORWARD)
            |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1015],0)))
                 |--Stream Aggregate(DEFINE:([Expr1015]=Count(*)))
                      |--Index Seek(OBJECT:(@Sample AS [x]), SEEK:([x].[ParentID]=@Sample.[ParentID] as [s].[ParentID]) ORDERED FORWARD)


  |--Filter(WHERE:([Expr1007]=(3)))
       |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1010],0)))
            |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1010]=Count(*)))
                 |--Nested Loops(Left Anti Semi Join, OUTER REFERENCES:([s].[ParentID]))
                      |--Index Scan(OBJECT:(@Sample AS [s]),  WHERE:(@Sample.[Keyword] as [s].[Keyword]='one' OR @Sample.[Keyword] as [s].[Keyword]='three' OR @Sample.[Keyword] as [s].[Keyword]='two') ORDERED FORWARD)
                      |--Index Seek(OBJECT:(@Sample AS [x]), SEEK:([x].[ParentID]=@Sample.[ParentID] as [s].[ParentID]),  WHERE:(@Sample.[Keyword] as [x].[Keyword]<>'one' AND @Sample.[Keyword] as [x].[Keyword]<>'three' AND @Sample.[Keyword] as [x].[Keyword]<>'two') ORDERED FORWARD)


 |--Filter(WHERE:([Expr1008]=N',one,three,two'))
       |--Nested Loops(Inner Join, OUTER REFERENCES:([ParentID]))
            |--Stream Aggregate(GROUP BY:([ParentID]))
            |    |--Index Scan(OBJECT:(@sample), ORDERED FORWARD)
            |--UDX(([Expr1007], [C].[Keyword]))
                 |--Compute Scalar(DEFINE:([Expr1007]=CONVERT(varchar(max),',',0)+@sample.[Keyword] as [C].[Keyword]))
                      |--Index Seek(OBJECT:(@sample AS [C]), SEEK:([C].[ParentID]=[ParentID]) ORDERED FORWARD)


   |--Filter(WHERE:([Expr1005]=(3)))
       |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1010],0)))
            |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1010]=Count(*)))
                 |--Filter(WHERE:(([Keyword]='one' OR [Keyword]='three' OR [Keyword]='two') AND [Expr1004]=(3)))
                      |--Nested Loops(Inner Join)
                           |--Table Spool
                           |    |--Segment
                           |         |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)
                           |--Nested Loops(Inner Join, WHERE:((1)))
                                |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1009],0)))
                                |    |--Stream Aggregate(DEFINE:([Expr1009]=Count(*)))
                                |         |--Table Spool
                                |--Table Spool


  |--Filter(WHERE:([Expr1004]=(1) AND [Expr1005]=(3)))
       |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1004]=MIN([Expr1006]), [Expr1005]=SUM([Expr1006])))
            |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Keyword]='three' OR [Keyword]='two' OR [Keyword]='one' THEN (1) ELSE (0) END))
                 |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)

 

posted @ Wednesday, June 30, 2010 9:41 PM | Feedback (1) | Filed Under [ Optimization SQL Server 2008 Algorithms SQL Server 2005 SQL Server 2000 ]

Another bin-packaging algorithm using recursion and XML

This time I will show you an algorithm to do the dreaded bin-packaging using recursion and XML.
First, create some sample data like this

-- Prepare sample data
DECLARE @Sample TABLE
        (
            RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
            Expense SMALLMONEY NOT NULL
        )
 
-- Populate sample data
INSERT  @Sample
        (
            Expense
        )
VALUES  (12.51),
        (45.63),
        (66.35),
        (92.66),
        (65.46),
        (54.01),
        (32.23),
        (27.16),
        (78.92),
        (14.58)
 
Next, we need to create a variable to hold the user's wanted total sum.

-- Prepare user supplied parameter
DECLARE @WantedSUM SMALLMONEY = 111.09

And we also need to create a temporary staging table to hold the valid combinations
 
-- Prepare temporary staging table
DECLARE @Temp TABLE
        (
            CombID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
            TrackPath XML NOT NULL
        )

Now we only have to do the calculations!
Here I am using a special trick to get the unique combination, since the path of records 1>2>3 is the same as 1>3>2, 2>1>3, 2>3>1, 3>1>2 and 3>2>1. See explanation between combination and permutation at Wikipedia here.
To keep track of which records I already have used in the total sum, I simply remove the record id (RowID) from the Hits list.
And, to give the correct answer at the end, I build a XML string with visited RowID's building up the correct sum.
 
-- Calculate all possible permutations using recursion
;WITH ctePack(Total, Hits, TrackPath)
AS (
        SELECT  s.Expense AS Total,
                (SELECT '#' + CAST(x.RowID AS VARCHAR(MAX)) FROM @Sample AS x WHERE x.RowID <> s.RowID ORDER BY x.RowID FOR XML PATH('')) + '#' AS Hits,
                '<ID>' + CAST(s.RowID AS VARCHAR(MAX)) + '</ID>' AS TrackPath
        FROM    @Sample AS s
        WHERE   s.Expense <= @WantedSum
 
        UNION ALL
 
        SELECT      p.Total + s.Expense,
                    REPLACE(p.Hits, '#' + CAST(s.RowID AS VARCHAR(MAX)) + '#', '#') AS Hits,
                    p.TrackPath + '<ID>' + CAST(s.RowID AS VARCHAR(MAX)) + '</ID>' AS TrackPath
        FROM        @Sample AS s
        INNER JOIN  ctePack AS p ON p.Hits LIKE '%#' + CAST(s.RowID AS VARCHAR(MAX)) + '#%'
        WHERE       p.Total + s.Expense <= @WantedSum
)
INSERT      @Temp
            (
                TrackPath
            )
SELECT      MIN(TrackPath)
FROM        ctePack
WHERE       Total = @WantedSum
GROUP BY    Hits

When the iterations are over, and we have the wanted combniation(s), the task left is to report the records giving us the correct sum.
We also need the records grouped so that we can see which group each expense belong to. In same cases, one and the same record may used in multiple groups.
 
-- Display the final resultset
SELECT      t.CombID,
            s.RowID,
            s.Expense
FROM        @Temp AS t
CROSS APPLY t.TrackPath.nodes('/ID') AS f(n)
INNER JOIN  @Sample AS s ON s.RowID = f.n.value('.', 'INT')
ORDER BY    t.CombID,
            s.RowID

If you don't want to use XML, you can write the recursive cte like this, to get all included records directly.

-- Calculate all possible permutations using recursion
;WITH ctePack(RowID, Expense, Total, Tracker)
AS (
        SELECT  s.RowID,
                s.Expense,
                s.Expense AS Total,
                (
                    SELECT      '#' + CAST(x.RowID AS VARCHAR(MAX))
                    FROM        @Sample AS x
                    WHERE       x.RowID <> s.RowID
                    ORDER BY    x.RowID
                    FOR
XML     PATH('')
                ) + '#' AS Tracker
        FROM    @Sample AS s
        WHERE   s.Expense <= @WantedSum

        UNION ALL

        SELECT      s.RowID,
                    s.Expense,
                    p.Total + s.Expense,
                    REPLACE(p.Tracker, '#' + CAST(s.RowID AS VARCHAR(MAX)) + '#', '#') AS Tracker
        FROM        @Sample AS s
        INNER JOIN  ctePack AS p ON p.Tracker LIKE '%#' + CAST(s.RowID AS VARCHAR(MAX)) + '#%'
        WHERE       p.Total + s.Expense <= @WantedSum
)
SELECT  DISTINCT
        DENSE_RANK() OVER (ORDER BY Tracker) AS CombID,
        RowID,       
        Expense
FROM    ctePack
WHERE   Total = @WantedSum

posted @ Wednesday, June 30, 2010 1:44 PM | Feedback (1) | Filed Under [ Optimization SQL Server 2008 Algorithms SQL Server 2005 ]

Powered by:
Powered By Subtext Powered By ASP.NET