As many of you know, I strongly recommend that you avoid using
RIGHT OUTER JOINs, since they make your SQL code less readable and are easily rewritten as
LEFT OUTER JOINs. In addition, I have yet to find a situation where a
FULL OUTER JOIN makes sense or is necessary -- I have found that in just about every case other techniques work better. (Feel free to suggest some FULL OUTER JOIN situations in the comments and we can discuss)
Let's take a common situation where you need to "merge" the data from two tables, and rows may or may not exist in
either table. For example, suppose we have budgets in one table and actuals in another and we wish to present them side by side. Many people would use a FULL OUTER JOIN to accomplish this, perhaps writing something like this:
SELECT
coalesce(a.company,b.company) as company,
coalesce(a.account, b.account) as account,
coalesce(a.year,b.year) as year,
coalesce(a.month, b.month) as month,
coalesce(a.amount,0) as Actual,
coalesce(b.amount,0) as Budget
FROM
Actuals a
FULL OUTER JOIN
Budgets b on
a.company = b.company and
a.account = b.account and
a.year = b.year and
a.month = b.month
The above FULL OUTER JOIN will effectively "merge" the two tables and allow you to see all actuals and budgets for every company/account/year/month, and it will show values from either table even if there is not a matching value in the other table. Essentially, it gets the job done.
There are some important things to note in the above, however:
- There is no relation between the Budgets and Actuals table (i.e., they do not have a 1:1 or 1:M relationship to each other) yet we are joining them together! To me, this does not make logical sense.
- Suppose that the PK of the Budgets table allows for more than 1 budget row per Company/Account/Year/Month -- this would result in two Budget rows matching a single Actual row, duplicating the Actual amount. You must be absolutely sure when doing a FULL OUTER JOIN with any sort of totaling that the two tables will never have more than 1 matching row to the other table.
- What is the "driving" table behind this SQL statement? Even though we are clearly selecting FROM the Actuals table, the Actuals do not "drive" our results. And even though we are joining FROM the Actuals TO the Budgets, some rows will have data only from the Budget table without matching Actual rows. To me, this is difficult to read and logically interpret; it doesn't follow the standard "select from table A join to table B" logic that is clear and easy to understand and work with. (note: this is basically my same argument against RIGHT OUTER JOINs.)
- Every column that is returned from either table is potentially nullable. All of them. You cannot reference a column in either table without handling the case where that column might have a null value. Thus, every column in either table must be wrapped in an ISNULL(), CASE, or COALESCE() expression. Most importantly: this includes all non-nullable primary key columns in both tables!
- Once you've wrapped every column in an expression, no further use of indexes from those columns can be used. For example, if I join the result of this FULL OUTER JOIN to other tables to show Account or Company names or information, no existing indexes on the Actual or Budget table can be used on that join since every column is an expression! Thus, as soon as you use a FULL OUTER JOIN, you completely eliminate all indexes from both tables involved.
- Since no indexes are usable in the results, any sorting done on the results is done on expressions and it will not be optimally efficient as well.
So, based on the above, I feel there's two basic issues with FULL OUTER JOINS: The code itself is difficult to interpret and isn't especially clear, and the result returned is just a big mess of nullable columns all wrapped in expressions which isn't a clean and efficient set of data to work with.
I have always felt that it is very important in any SELECT to clearly establish your FROM clause as the primary, "driving" data source, and then to join from that primary source to the auxiliary tables or SQL statements via JOINS. However, in our example and with FULL OUTER JOINs in general, we don't really want to relate one table to another, we want to MERGE these two tables together. When merging the data from two tables, doesn't a
UNION make more sense? Why express our intentions as a (FULL OUTER) JOIN when the true operation you are after is actually a UNION?
Consider this SQL statement as an alternative:
SELECT
company,
account,
year,
month,
sum(actual) as actual,
sum(budget) as budget
FROM
(
SELECT
company, account, year, month, amount as actual, 0 as budget
FROM
Actuals
UNION ALL
SELECT
company, account, year, month, 0 as actual, amount as budget
FROM
Budgets
) x
GROUP BY
company, account, year, month
(note: we are basically using the technique I described
here).
Some notes:
- Since there is technically no direct relation between Budgets and Actuals (i.e., there is not a 1:1 or 1:M relation between the two tables) this SQL make more sense because it is not implying or stating that we are joining these two tables together.
- If there are multiple budget amounts that match a single Actual amount (or vice versa) for a given company/account/year/month, the correct totals are still calculated -- since we are not joining the two tables, the Actual row will not be duplicated if it matches multiple Budget rows.
- The UNION makes it very clear that we are taking the results from two tables together, and merging them into 1 set of rows. The FROM clause truly and accurately describes the primary data that this SELECT will use.
- The primary key columns returned are not wrapped in a COALESCE() functions; they can never be NULL since they come directly from one of the two tables.
- Joins or sorts on our indexed columns can now be used.
Even if the UNION ended up being a little longer or even in some cases slightly less efficient, I still feel it is clearer and more readable to use a UNION instead of a FULL OUTER JOIN and it is usually worth the cost.
Other ways of handling this situation would be to use a CROSS JOIN as well, which might even be necessary depending on the results that we want. If we want to always return
all combinations of Companies/Accounts/Months/Years even if there is no Budget or Actual data available for that combination, a CROSS JOIN is the answer you need:
SELECT
All.Company,
All.Account,
All.Year,
All.Month,
coalesce(Actuals.Amount,0) as Actual,
coalesce(Budgets.Amount,0) as Budget
FROM
(
SELECT
C.Company, A.Account, M.Year, M.Month
FROM
Companies C
CROSS JOIN
Accounts A
CROSS JOIN
Months M
WHERE
M.Year = 2006 -- or whatever criteria you need here ....
) All
LEFT OUTER JOIN
Actuals
ON Actuals.company = All.Company and
Actuals.Account = All.Account and
Actuals.Year = All.Year and
Actuals.Month = All.Month
LEFT OUTER JOIN
Budgets
ON Budgets.company = All.Company and
Budgets.Account = All.Account and
Budgets.Year = All.Year and
Budgets.Month = All.Month
(The technique shown above is discussed further
here.)
In the above, our primary "driving" data source is a series of CROSS JOINS that produces the combination of rows we need to return, and then from there we do two separate LEFT OUTER JOINS to our transactional tables. (Note that, as with a FULL OUTER JOIN, if the primary key of either the Budgets or the Actuals table isn't Co/Acct/Year/Month, you should join instead to summarized derived tables that group on those key columns to prevent duplicating rows in the result.)
Thus, even though it seems that maybe a FULL OUTER JOIN or UNION is necessary to "merge" or two tables, sometimes a CROSS JOIN/LEFT OUTER JOIN can work as well, and it indeed fulfills a different requirement that is not possible (or easily achieved) with those other techniques by guaranteeing the exact set of rows to be returned, even if no matching transactions exist.
In conclusion, think carefully before unleashing your next FULL OUTER JOIN. I'm sure it has its place, and someday I'll see a situation where it makes sense for me, but for now, I'm going to stick with UNION, CROSS JOINS and LEFT OUTER JOIN to solve these SQL problems. After all, who wants to wrap every single column in a coalesce() function when it's so damn hard to spell !
Updates:
- I have published a test SQL script here that you can play around with to see some of the performance differences between a FULL OUTER JOIN and an UNION ALL (more on this in the comments). Preliminary testing shows that UNION ALL is about twice as fast.
- There is also a script here comparing a FULL OUTER JOIN with an equivalent CROSS JOIN. If the columns you are joining/grouping on are foreign key references to indexed tables in your database, a CROSS JOIN also can be slightly more efficient (in addition to being shorter and clearer) than using a FULL OUTER JOIN.
see also: