Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

Using GROUP BY to avoid self-joins

Sometimes, it appears that a necessary solution to common SQL problems is to join a table to itself.   While self-joins do indeed have their place, and can be very powerful and useful, often times there is a much easier and more efficient way to get the results you need when querying a single table.

For example, suppose we have "budget" and "actual" transactions together in a single table like this:

create table GLTrans
(
    TranID int primary key,

    Company varchar(10) not null,
    Type varchar(10) not null,
    TimePeriod int not null,
    Amount money not null
)

insert into GLTrans
select 1,'Company A', 'Actual', 1,20 union all
select 2,'Company A', 'Budget', 1,15 union all
select 3,'Company A', 'Actual', 2,30 union all
select 4,'Company A', 'Budget', 2,50 union all
select 5,'Company B', 'Actual', 1,25 union all
select 6,'Company B', 'Budget', 1,30 union all
select 8,'Company B', 'Budget', 2,25

If we'd like to see a variance (Actual-Budget) calculation for all companies, it is often thought that a self-join is necessary, and the SELECT might be written like this:

select a.Company, a.Actual, b.Budget, a.Actual-b.Budget as Variance
from
    (select company, sum(Amount) as Actual
    from GLTrans
    where type='Actual'
    group by company
    ) a
inner join
    (select company, sum(Amount) as Budget
    from GLTrans
    where type='Budget'
    group by company
    ) b
    on a.company = b.company

Company    Actual                Budget                Variance
---------- --------------------- --------------------- ---------------------
Company A  50.00                 65.00                 -15.00
Company B  25.00                 55.00                 -30.00

(2 row(s) affected)


Basically, we have one derived table that retrieves the actual per company, another that retrieves the budget per company, and we join them together.  However, in this case, there is no need for a join at all!  All of the data we need is already in one table, and a simple GROUP BY combined with CASE does the trick perfectly, and also much more efficiently:

select x.*, Actual-Budget as Variance
from
(
    select
        company,

        sum(case when type='Actual' then amount else 0 end) as Actual,
        sum(case when type='Budget' then amount else 0 end) as Budget
    from
        GLTrans
    group by
        company

) x

company    Actual                Budget                Variance
---------- --------------------- --------------------- ---------------------
Company A  50.00                 65.00                 -15.00
Company B  25.00                 55.00                 -30.00

(2 row(s) affected)

There are other advantages as well; suppose we add a company with an Actual transaction but no Budget transaction, and one with the opposite situation:

insert into GLTrans
select 9,'Company C','Actual',1,20 union all
select 10,'Company D','Budget',1,15

If we run the SELECT with the self-join, we don't get results back for either of those companies.  (Try it!)  We must use an ugly FULL OUTER JOIN to get the results we need.  However, with the GROUP BY solution, the results come out perfectly:

company    Actual                Budget                Variance
---------- --------------------- --------------------- ---------------------
Company A  50.00                 65.00                 -15.00
Company B  25.00                 55.00                 -30.00
Company C  20.00                 0.00                  20.00
Company D  0.00                  15.00                 -15.00

(4 row(s) affected)

Another situation is if you need to checking to see if a one-to-many relation contains a specific set of values.  For example, suppose we have a table like this:

create table Example2
(
    Letter char(1) not null,

    Number int not null,
    primary key (Letter,Number)
)


insert into Example2
select 'A',1 union all
select 'A',2 union all
select 'A',3 union all
select 'B',2 union all
select 'B',3 union all
select 'C',2 union all
select 'D',3 union all
select 'E',1 union all
select 'E',2 union all
select 'E',3

And we'd like to find out which letters have both the numbers 2 and 3 in this table.   Again, many times people think a self-join is the way to go and write something like this:

select
    a.Letter

from
    Example2 a

inner join
    Example2 b on a.letter = b.letter

where
    a.number = 2 and b.number = 3


Letter
------
A
B
E

(3 row(s) affected)

And that does indeed work, but it is not as efficient as can be.  Why? Because it requires two passes through the table, but a simple GROUP BY without any joins at all will do the job just fine:

select
    Letter

from
    Example2

where
    number in (2,3)

group by
    Letter

having
    count(*)=2


Letter
------
A
B
E

(3 row(s) affected)

In addition to being faster and shorter, the GROUP BY has another benefit: it is much more flexible. This SELECT can be easily adjusted to find N matching numbers without too much modification at all. For example, to find all of the letters with the numbers 1,2 and 3, using JOINs we would have to add another self-join like this:

select
    a.Letter
from
    Example2 a
inner join
    Example2 b on a.letter = b.letter
inner join
    Example2 c on a.letter = c.letter

where
    a.number = 2 and b.number = 3 and c.number =1

Letter
------
A
E

(2 row(s) affected)

But with our GROUP BY technique, we simply edit our WHERE and HAVING clauses:

select
    Letter

from
    Example2

where
    number in (1,2,3)

group by
    Letter

having
    count(*)=3

Letter
------
A
E

(2 row(s) affected)

This can be extended to find any number of matching values, and we can even put them in a table and JOIN to it.  That would be impossible to do using self-joins without some crazy dynamic SQL going on.

So, the next time you want to query a table and you think a self-join is necessary, consider first if a GROUP BY might do the trick.  Of course, self-joins have their place, and often can and should be used, but sometimes we can do things more efficiently without any joins at all.  By summarizing the table properly, and making clever use of CASE expressions, you can often get the results you need with only one efficient pass through the table.

see also:

Legacy Comments


Saggi Neumann
2007-06-12
re: Using GROUP BY to avoid self-joins
Hi there,

Correct me if I'm wrong, but the GROUP BY query you posted for all letters that have 2 and 3, will also return results for those that have two 2s or two 3s and not only those that have a 2 and a 3, so it's not really equivalent to your original query.

The query for reference:

select
Letter
from
Example2
where
number in (2,3)
group by
Letter
having
count(*)=2

Cheers,
S. Neumann

Jeff
2007-06-12
re: Using GROUP BY to avoid self-joins
The table has a composite primary key of (Letter/Number), so it is not an issue. However, even without that pk constraint in place, you can easily just summarize the table first in a derived table, or use HAVING COUNT(Distinct Number) =2 as well.

Of course, in general, if you don't have well modeled natural primary keys in your data, lots of efficient sql techniques won't work very well.

Rick O
2007-06-12
re: Using GROUP BY to avoid self-joins
Saggi- I was going to point that out, too, until I went back and looked at the table definition:

primary key (Letter,Number)

Thus, each letter can only have one number.

A S
2008-10-03
re: Using GROUP BY to avoid self-joins
Can you please tell me whether a similar GROUP by can be used for the following without using a temporary table or cursor?

Key 1 Key 2
---------------
1 2
1 3
2 4
3 5
4 7
10 11
11 12
13 14

I would like the output to show the related rows in the chain:
Group 1 --> 1 , 2 , 3 , 4 , 5, 7 (1 is related to 2 and 3 based on the first two rows, 2 is realed to 4 based on the third row, and so on)
Group 2 --> 10, 11, 12
Group 3 --> 13, 14

Thanks
A S

Mahmood Hassan
2009-08-24
re: Using GROUP BY to avoid self-joins
Hi,

I have a Log Table. Table structure is like below.

PK Value DateTime NewValue oldValue
100 2009-01-01: 00:01:01 10 9

100 2009-01-002: 00:10:10 11 10

What would be the efficient ms sql query to pick the latest entry?

something like ?

SELECT PK Value, MAX(DateTime), NewValue, OldValue


Regards,
Mahmood