Sometimes, it appears that a necessary solution to common SQL problems is to join a table to itself. While self-joins do indeed have their place, and can be very powerful and useful, often times there is a much easier and more efficient way to get the results you need when querying a single table.
For example, suppose we have "budget" and "actual" transactions together in a single table like this:
create table GLTrans
(
TranID int primary key,
Company varchar(10) not null,
Type varchar(10) not null,
TimePeriod int not null,
Amount money not null
)
insert into GLTrans
select 1,'Company A', 'Actual', 1,20 union all
select 2,'Company A', 'Budget', 1,15 union all
select 3,'Company A', 'Actual', 2,30 union all
select 4,'Company A', 'Budget', 2,50 union all
select 5,'Company B', 'Actual', 1,25 union all
select 6,'Company B', 'Budget', 1,30 union all
select 8,'Company B', 'Budget', 2,25
If we'd like to see a variance (Actual-Budget) calculation for all companies, it is often thought that a self-join is necessary, and the SELECT might be written like this:
select a.Company, a.Actual, b.Budget, a.Actual-b.Budget as Variance
from
(select company, sum(Amount) as Actual
from GLTrans
where type='Actual'
group by company
) a
inner join
(select company, sum(Amount) as Budget
from GLTrans
where type='Budget'
group by company
) b
on a.company = b.company
Company Actual Budget Variance
---------- --------------------- --------------------- ---------------------
Company A 50.00 65.00 -15.00
Company B 25.00 55.00 -30.00
(2 row(s) affected)
Basically, we have one derived table that retrieves the actual per company, another that retrieves the budget per company, and we join them together. However, in this case, there is no need for a join at all! All of the data we need is already in one table, and a simple GROUP BY combined with CASE does the trick perfectly, and also much more efficiently:
select x.*, Actual-Budget as Variance
from
(
select
company,
sum(case when type='Actual' then amount else 0 end) as Actual,
sum(case when type='Budget' then amount else 0 end) as Budget
from
GLTrans
group by
company
) x
company Actual Budget Variance
---------- --------------------- --------------------- ---------------------
Company A 50.00 65.00 -15.00
Company B 25.00 55.00 -30.00
(2 row(s) affected)
There are other advantages as well; suppose we add a company with an Actual transaction but no Budget transaction, and one with the opposite situation:
insert into GLTrans
select 9,'Company C','Actual',1,20 union all
select 10,'Company D','Budget',1,15
If we run the SELECT with the self-join, we don't get results back for either of those companies. (Try it!) We must use an ugly
FULL OUTER JOIN to get the results we need. However, with the GROUP BY solution, the results come out perfectly:
company Actual Budget Variance
---------- --------------------- --------------------- ---------------------
Company A 50.00 65.00 -15.00
Company B 25.00 55.00 -30.00
Company C 20.00 0.00 20.00
Company D 0.00 15.00 -15.00
(4 row(s) affected)
Another situation is if you need to checking to see if a one-to-many relation contains a specific set of values. For example, suppose we have a table like this:
create table Example2
(
Letter char(1) not null,
Number int not null,
primary key (Letter,Number)
)
insert into Example2
select 'A',1 union all
select 'A',2 union all
select 'A',3 union all
select 'B',2 union all
select 'B',3 union all
select 'C',2 union all
select 'D',3 union all
select 'E',1 union all
select 'E',2 union all
select 'E',3
And we'd like to find out which letters have both the numbers 2 and 3 in this table. Again, many times people think a self-join is the way to go and write something like this:
select
a.Letter
from
Example2 a
inner join
Example2 b on a.letter = b.letter
where
a.number = 2 and b.number = 3
Letter
------
A
B
E
(3 row(s) affected)
And that does indeed work, but it is not as efficient as can be. Why? Because it requires two passes through the table, but a simple GROUP BY without any joins at all will do the job just fine:
select
Letter
from
Example2
where
number in (2,3)
group by
Letter
having
count(*)=2
Letter
------
A
B
E
(3 row(s) affected)
In addition to being faster and shorter, the GROUP BY has another benefit: it is much more flexible. This SELECT can be easily adjusted to find N matching numbers without too much modification at all. For example, to find all of the letters with the numbers 1,2 and 3, using JOINs we would have to add another self-join like this:
select
a.Letter
from
Example2 a
inner join
Example2 b on a.letter = b.letter
inner join
Example2 c on a.letter = c.letter
where
a.number = 2 and b.number = 3 and c.number =1
Letter
------
A
E
(2 row(s) affected)
But with our GROUP BY technique, we simply edit our WHERE and HAVING clauses:
select
Letter
from
Example2
where
number in (1,2,3)
group by
Letter
having
count(*)=3
Letter
------
A
E
(2 row(s) affected)
This can be extended to find any number of matching values, and we can even put them in a table and JOIN to it. That would be impossible to do using self-joins without some crazy dynamic SQL going on.
So, the next time you want to query a table and you think a self-join is necessary, consider first if a GROUP BY might do the trick. Of course, self-joins have their place, and often can and should be used, but sometimes we can do things more efficiently without any joins at all. By summarizing the table properly, and making clever use of CASE expressions, you can often get the results you need with only one efficient pass through the table.
see also: