Simplify Your SQL with Variables and Derived Tables (or Common Table Expressions)

Thu Dec 20, 2007 by Jeff Smith in t-sql, techniques, efficiency, group-by

As with any programming language, it is important in SQL to keep your code short, clear and concise. Here are two quick tips that I find are very helpful in obtaining this goal.

Tip 1: For any relatively complicated constant expression, always declare a variable

Consider the following:

select *
from Transactions
where TranDate >= {some long, complicated expression to determine the start date} and
TranDate < {some long, complicated expression to determine the end date}

This is one of the most common things that I see in the SQLTeam forums, from both the people asking questions and the people giving answers. If the starting date and ending date range are constants for the entire SELECT (i.e., they don't vary by row, they are based on the current date or some parameters passed to the stored procedure), then simply declare them as variables, set the values once, and reference those variables in your WHERE clause:

declare @start datetime, @end datetime
set @start = {some long, complicated expression to determine the start date}
set @end = {some long, complicated expression to determine the end date}

select * from Transactions
where TranDate >= @start and TranDate < @end

Now, you could argue that we just technically made the code longer. But, we have made our code clearer and much easier to read and debug! We can now print and/or analyze the @start and @end dates to make sure that our formulas are working, without repeatedly running the entire SELECT over and over and glancing at the results and trying to determine if they "look right".

We can also potentially use these variables to make these formulas (or others) much simpler. For example, if the end date is always one day later than the start date, we can write:

declare @start datetime, @end datetime
set @start = {some long, complicated expression to determine the start date}
set @end = dateadd(day, 1, @start)

select * from Transactions
where TranDate >= @start and TranDate < @end

That is much, much easier to read, write and debug then if we repeated the long complicated expression twice, and it makes it very clear what the relationship is between the start date and the ending date. If we need to tweak our starting date formula, we can do it in one place and again it is easier to debug since we can just print out the value of the variable. We also have the ability to name our variables intelligently to help with the readability of our code.

Tip 2: To avoid repeating non-constant expressions, use a Derived Table or a Common Table Expression (CTE)

Another common thing found in lots of SQL code is to repeat the non-constant expressions over and over. For example, if we have a DateTime column called TranDate, this expression will round that date to the first day of the month:

dateadd(m, datediff(m,0, TranDate),0)

Using that knowledge, if we need to summarize transactions by month, the result often ends up looking something like this:

select
dateadd(m, datediff(m,0, TranDate),0) as [Month], sum(value)
from
table
where
dateadd(m, datediff(m,0, TranDate),0) >= @min and
dateadd(m, datediff(m,0, TranDate),0) <= @max
group by
dateadd(m, datediff(m,0, TranDate),0)
order by
dateadd(m, datediff(m,0, TranDate),0)

Notice that the date expression is repeated five times in the SQL statement! Looks kind of silly, but I see it over and over again from both beginner and experienced programmers alike. The solution to simplifying this is to calculate the formula once and assign it a meaningful alias using a derived table:

select
[Month], sum(value)
from
(
   select
     dateadd(m, datediff(m,0, TranDate),0) as [Month], value
   from
     table
) x
where
    x.[Month] >= @min and x.[Month] <= @max
group by
    x.[Month]
order by
    x.[Month]

Now the SELECT is much easier to read and maintain, and we can access the result of the calculation in the outer SELECT as many times as we need by referencing the "month" column returned by the derived table. Also, if our "month" formula is wrong, or needs tweaking or testing, it can all be done in one place and it doesn't need to be repeated over and over.

If you are using SQL Server version 2005 or greater, you can also use a Common Table Expression (CTE) in the same way. In fact, in many ways CTEs are even more readable and easier to work with than derived tables since the code doesn't require as much nesting and is more linear. Here's the previous example as a CTE:

with MonthSummary as
(
   select
     dateadd(m, datediff(m,0, TranDate),0) as [Month], value
   from
     table
)
select
[Month], sum(value)
from
MonthSummary
where
[Month] >= @min and [Month] <= @max
group by
[Month]
order by
[Month]

Repeating expressions is common for mathematical formulas as well, such as using CASE to avoid dreaded "divide by zero" errors when the denominator is an expression:

select
c as Numerator, a+b as Denominator,
case when a+b = 0 then null else c / (a+b) end as Result
from
tbl

Notice that a+b is repeated three times in the previous SELECT statement. Again, using a derived table, we can easily change this to:

select
Numerator,
Denominator,
case when Denominator = 0 the null else Numerator / Denominator end as Result
from
(
select
c as Numerator, a+b as Denominator
from
tbl
) x

... and once again, we can now clearly label our expression and see exactly what it is and our code is much clearer and easier to maintain. This example is quite simple and the difference doesn't seem like much, but if that expression is long and complicated and used in other places as well, the improvement in the code can be tremendous.

Conclusion

It is not just readability that we are concerned with, it is also eliminating and finding bugs and optimizing code. Repeating expressions in more than one place, or embedding them deep into SQL statements where we can never really be sure they are working, can result in bugs that are tough to track down. It is also can force the optimizer to calculate these expressions over and over as well, when they should optimally only be calculated once.

So, remember: for constant expressions that don't change, just declare and set variables, and examine the variables' contents before using them anywhere to ensure that your expressions are working as you intend. For repeated expressions that are based on values in the tables you are querying, consider calculating them once within a Derived Table or a Common Table Expression. In addition, be sure to use meaningful variable names and aliases which will help to greatly improve the clarity and intentions of your code.

It all comes back to the "golden rule" of programming: If you need to cut and paste, you're probably doing it wrong!

Legacy Comments


Bob 2007-12-20	re: Simplify Your SQL with Variables and Derived Tables All of this has to do with simplifying maintenance and debugging. All too often I find myself getting working code and moving it into production because time demands that I move on to the next emergency. I DO find myself using variables a lot for all the reasons you mention. Probably a good habit I carried over from my software coding days. The derived table solution is a nice trick but i don't expect I'll ever find myself using it. It feels like it actually complicates the query, making maintenance and debugging more difficult. Lately, I've found myself making regular use of a SQL formatter before moving code to production. Frequently I'll hack together a rather unintelligible query in my haste to get it working but realize that if I ever come back to it I'll never know what I was thinking. A SQL formatter ensures that everything is lined up, spaced and capitalized before I commit. I've been using this online tool: http://www.wangz.net/cgi-bin/pp/gsqlparser/sqlpp/sqlformat.tpl Very flexible with lots of options.

Jeff Smith 2007-12-20	re: Simplify Your SQL with Variables and Derived Tables >>The derived table solution is a nice trick but i don't expect I'll ever find myself using it. It feels like it actually complicates the query, making maintenance and debugging more difficult. Really? That is pretty surprising. I cannot think of any reasonable way to assume that editing a single expression multiple times in multiple places is easier to maintain/better than having it in exactly one place clearly labeled.

Seth Petry-Johnson 2007-12-20	What about CTEs? How exactly does your derived table example differ from using a SQL Server 2005 Common Table Expression? It looks to me like they are both ways of "aliasing" a complex calculation to facilitate easier query statements... can derived tables and CTEs be used interchangeably, or are there performance characteristics that make one better than the other in certain situations?

Jeff Smith 2007-12-20	re: Simplify Your SQL with Variables and Derived Tables Hi Seth -- great point! CTEs are equally useful, and in many ways result in more readable code than using Derived Tables. When I get a chance, I will add in a note about CTEs with perhaps an example. Thank you!!

Jeff Smith 2007-12-20	re: Simplify Your SQL with Variables and Derived Tables (or Common Table Expressions) * update * The article has been updated to include CTEs.

WesleyB 2007-12-21	re: Simplify Your SQL with Variables and Derived Tables (or Common Table Expressions) Using local variables could cause serious performance hits because the optimizer has a hard time determining the selectivity when using local variables. Therefore he uses 30% of the rows as a 'guess' which may render sub-optimal query plans.

newtosql 2008-11-24	re: Simplify Your SQL with Variables and Derived Tables (or Common Table Expressions) Good!!