There's a fascinating technique that I see many beginners use when writing SQL statements, and I call it the "Nested WHERE-IN" anti-pattern. It is, unfortunately, a common SQL technique used to avoid JOINS at all costs. For example, using Northwind, suppose we want to find all customers that have ordered ProductID 1. The Nested WHERE-IN technique results in code like this:
select *
from Customers
where CustomerID in
(select customerID
from Orders
where OrderID in
(select orderID
from [Order Details]
where ProductID = 1
)
)
That's just a simple example, but of course it gets nested more deeply (or branches out) as more conditions are added.
I think beginners gravitate towards this approach because:
- No "tricky" joins are required
- It is written more or less the way you would describe what you want: "return customers that have an Order that has ProductID 1 in it."
- No "strange" duplication of results occur, which can happen with joins if you don't group things properly
In short, I think this technique surfaces because you can just start writing, without any planning or thinking, and the code sort of naturally comes out as you go. And, it does technically return the right results most of the time, so there doesn't seem to be a reason not to do things this way.
So, you might ask, how would you suggest that it should be written? As I often say, break it down into small, separate parts and put them all together in the end. Our first task is to write a SELECT that returns a distinct list of CustomerID's that have ordered productID 1:
select o.customerID
from orders o
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1
group by o.customerID
With that, we know we can simply inner join from the customers table to this and our Customer table will be filtered with the exact results we need:
select c.*
from customers c
inner join
(
select o.customerID
from orders o
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1
group by o.customerID
) x on c.CustomerID = x.CustomerID
Now, if we look at the execution plans of the two techniques, we will actually see that (in SQL 2000, anyway) they are exactly the same. So, what's the problem? Why shouldn't you use the "Nested WHERE-IN" anti-pattern?
Well, the main issue is maintainability, and how it affects not only the complexity of the SELECT, but performance as well.
Let's suppose we also need to return the total number of times that ProductID 1 has been ordered for each customer. The Nested WHERE-IN SELECT has no way to do this; our only option is to add a correlated subquery or an additional JOIN. Since we are avoiding JOINS in general with this approach, let's use a corrected sub-query:
select *, (select count(*)
from orders
where orderID in
(select orderID
from orders
where orders.customerID=customers.customerID)) as num
from Customers
where CustomerID in
(select customerID
from Orders
where OrderID in
(select orderID
from [Order Details]
where ProductID = 1
)
)
Notice the redundancy; we are basically expressing the same "join" to the orders tables in two different places. Adding more columns from the nested WHERE-IN clause will require more sub-queries or additional joins.
When using a standard JOIN, we simply add a count(*) expression to our derived table and return it:
select c.*, x.num
from customers c
inner join
(
select o.customerID, count(*) as num
from orders o
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1
group by customerID
) x on c.CustomerID = x.CustomerID
Take a look at the execution plans now. Not only is the JOIN code shorter, it is twice as fast. In general, that's the way a SQL statement should be built -- joining SELECT statements together and return the columns or calculations that you need, not by deeply nesting WHERE-IN clauses or correlated sub-queries over and over.
The other problem with this technique is that it leads you to believe that every table in your database must have a single column primary key. This leads to bad database designs where tables that relate entities don't have proper composite primary keys, but just meaningless identities. This is because you cannot use WHERE IN to return entities that have more than one column define them.
Finally, you might also have issues with NULL values and performance problems in general as you nest these IN clauses one inside another, especially if you use the even more dreaded "Nested WHERE-NOT IN" anti-pattern. (A little more on that
here.)
In short, try to avoid using Nested WHERE-IN clauses. Learn to use proper JOINS and consider thinking carefully and planning your SQL statements. Use derived tables to logically group things into pieces. I promise that if you take the time to learn the proper joining techniques, you will not regret it.
(
Update: more on all this
here.)
see also: