Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

"Nested WHERE-IN" Anti-Pattern Follow-up; More on Derived Tables (sub-queries)

A quick follow up to the "Nested WHERE-IN" anti-pattern post from yesterday …  If you didn't get a chance, be sure to read the comments from that post as well, there are some great points in there so far. 

Going back to Northwind and our example of returning all customers that have ordered productID 1, let's assume that we didn't check the schema carefully and we did not realize that the CustomerID column is not in the Order Details table.  So, we might try to write our SELECT like this:

select *
from customers
where customerID in
  (select customerID from [order details] where productID = 1)

Now, try running that ... it works!  Well, sort of ... it runs, returns results, generates no errors -- and it produces the wrong results!  All customers are returned, not just those that ordered productID 1.  This is an issue with correlated sub-queries in general, in that if a column is not prefixed with a table name, and it doesn't exist in the table(s) in the sub-query, SQL will assume it is a reference to a column in the parent query.  In other words, the above SQL is checking to see if customerID is equal to itself, so it returns all of the results.  This is completely logical and makes perfect sense from a processing perspective (I would not call it a bug), but it can lead to confusion when programmers are not aware of why and how this can happen.

If we try this with a standard JOIN:

select customers.*
from customers
inner join
  (select distinct customerID
  from [order details]
  where productID = 1
  ) x on x.customerID = customers.customerID

we get:

Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'customerID'.

Which is what we expect -- an error message telling us that we are trying to use a column that does not exist in the table.

Perhaps this is a minor issue, but it does happen all the time, and it can be maddening to troubleshoot .. especially for beginners just learning the language.  As mentioned, this is a general issue with correlated sub-queries in general, and part of the reason why I prefer to use derived tables.

When using derived tables, each sub-SELECT is fully self-contained and can contain no references to anything outside of it (except for parameters or variables); you can cut and paste that SELECT into a separate Query Analyzer window, run it, tweak it, test it, and so on, and when you are done, you can paste it right back in to your original SELECT confident that it works and does its job. To me, this is a huge benefit and a really great way to break down complicated SQL statements into small, more manageable parts.  (more on that here.)  I feel that learning to do this is crucial for a good SQL programmer, and using WHERE-IN and correlated subqueries as a first instinct instead doesn't cultivate a clean "break the problem down into smaller parts" mindset, but instead leads to a thought process of "keep adding things to the SELECT until it works." 

In general, just remember that while WHERE-IN clauses do have their place, learning to think in terms of JOINS and derived tables is a really important skill to acquire when working with SQL.  Sometimes, beginners find the "Nested WHERE-IN" technique to be most natural, but if instead you focus on learning JOINS and relations and derived tables, you will be a much better SQL programmer in the long run.

Legacy Comments


Sam
2007-07-13
re: "Nested WHERE-IN" Anti-Pattern Follow-up
why not the folowing ?

select distinct c.*
from customers c
inner join orders o on o.CustomerID = c.CustomerID
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1


Just so you dont do a "select distinct *" ? what if i just wanted .CustomerID is this ok?

select distinct c.CustomerID
from customers c
inner join orders o on o.CustomerID = c.CustomerID
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1

Are subquerys Better?

Jeff
2007-07-13
re: "Nested WHERE-IN" Anti-Pattern Follow-up
Sam --

You do not want DISTINCT on all columns in the customer table; you just want to return 1 row per customer. Thus, the distinct (or GROUP BY) should be done in the derived. Examine both the execution plan and the performance and you will see that using a derived table is more efficient.

As for your second question: if you just want to return customerID, there is no need to include the Customer table -- that is unnecessary and adds overhead to your SELECT. If you just want to return a distinct list of CustomerID's, you just write it the same way I did in my derived query:

select distinct o.CustomerID
from orders o
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1

That's kind of the entire point of this post and the last; you know you want to find distinct CustomerID's for those that ordered productID 1, so you write the SQL for that. Then, you know you want to return the customer columns such as Name and so on, so you join those results to the Customer table. And there's your results.

Peter
2007-07-20
re: "Nested WHERE-IN" Anti-Pattern Follow-up
WHERE-IN is a disaster, but I prefer WHERE-EXISTS to an inner joined derived table any day:

select * from customers a
where exists (
select *
from [order details] b
where a.customerID = b.customerID
and b.productID = 1
)

it clearly states your purpose, you do not need to control for expanding joins, and the inverse, NOT EXISTS, is more compact than a combination left join/is null.

*** NOTE ON DANGERS OF WHERE-NOT-IN ***

on MSSQL (2K and 2005), a SELECT clause with WHERE-NOT-IN will return no rows if the column you are selecting in subselect has any null values. try this:

create table #tmp1 (col1 int)
create table #tmp2 (col1 int)
insert #tmp1 values (1)
insert #tmp1 values (2)
insert #tmp1 values (3)

insert #tmp2 values (1)
insert #tmp2 values (2)
insert #tmp2 values (null)

select * from #tmp1 where col1 not in (select col1 from #tmp2)

You'd expect to see one row returned, col1 = 3, but the result set is empty.

preferred form:

select * from #tmp1 a
where not exists (
select * from #tmp2 b
where a.col1 = b.col1)

this returns the one row, as expected.