Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

The "Nested WHERE-IN" SQL Anti-Pattern

There's a fascinating technique that I see many beginners use when writing SQL statements, and I call it the "Nested WHERE-IN" anti-pattern.  It is, unfortunately, a common SQL technique used to avoid JOINS at all costs.  For example, using Northwind, suppose we want to find all customers that have ordered ProductID 1.  The Nested WHERE-IN technique results in code like this:

select *
from Customers
where CustomerID in
    (select customerID
    from Orders
    where OrderID in
        (select orderID
        from [Order Details]   
        where ProductID = 1
        )
    )

That's just a simple example, but of course it gets nested more deeply (or branches out) as more conditions are added.

I think beginners gravitate towards this approach because:

  1. No "tricky" joins are required
  2. It is written more or less the way you would describe what you want:  "return customers  that have an Order that has ProductID 1 in it."
  3. No "strange" duplication of results occur, which can happen with joins if you don't group things properly
In short, I think this technique surfaces because you can just start writing, without any planning or thinking, and the code sort of naturally comes out as you go.  And, it does technically return the right results most of the time, so there doesn't seem to be a reason not to do things this way.

So, you might ask, how would you suggest that it should be written?  As I often say, break it down into small, separate parts and put them all together in the end.  Our first task is to write a SELECT that returns a distinct list of CustomerID's that have ordered productID 1:

select o.customerID
from orders o
inner join [order details] od on od.orderID = o.orderID
where od.productID = 1
group by o.customerID

With that, we know we can simply inner join from the customers table to this and our Customer table will be filtered with the exact results we need:

select c.*
from customers c
inner join
(
    select o.customerID
    from orders o
    inner join [order details] od on od.orderID = o.orderID
    where od.productID = 1
    group by o.customerID
) x on c.CustomerID = x.CustomerID

Now, if we look at the execution plans of the two techniques, we will actually see that (in SQL 2000, anyway) they are exactly the same.   So, what's the problem?  Why shouldn't you use the "Nested WHERE-IN" anti-pattern?

Well, the main issue is maintainability, and how it affects not only the complexity of the SELECT, but performance as well.

Let's suppose we also need to return the total number of times that ProductID 1 has been ordered for each customer.  The Nested WHERE-IN SELECT has no way to do this; our only option is to add a correlated subquery or an additional JOIN. Since we are avoiding JOINS in general with this approach, let's use a corrected sub-query:

select *, (select count(*)
    from orders
    where orderID in
        (select orderID    
        from orders
        where orders.customerID=customers.customerID)) as num
from Customers
where CustomerID in
    (select customerID
    from Orders
    where OrderID in
        (select orderID
        from [Order Details]   
        where ProductID = 1
        )
    )

Notice the redundancy; we are basically expressing the same "join" to the orders tables in two different places. Adding more columns from the nested WHERE-IN clause will require more sub-queries or additional joins. 

When using a standard JOIN, we simply add a count(*) expression to our derived table and return it:

select c.*, x.num
from customers c
inner join
(
    select o.customerID, count(*) as num
    from orders o
    inner join [order details] od on od.orderID = o.orderID
    where od.productID = 1
    group by customerID
) x on c.CustomerID = x.CustomerID

Take a look at the execution plans now.  Not only is the JOIN code shorter, it is twice as fast.   In general, that's the way a SQL statement should be built -- joining SELECT statements together and return the columns or calculations that you need, not by deeply nesting WHERE-IN clauses or correlated sub-queries over and over.

The other problem with this technique is that it leads you to believe that every table in your database must have a single column primary key.   This leads to bad database designs where tables that relate entities don't have proper composite primary keys, but just meaningless identities.  This is because you cannot use WHERE IN to return entities that have more than one column define them. 

Finally, you might also have issues with NULL values and performance problems in general as you nest these IN clauses one inside another, especially if you use the even more dreaded "Nested WHERE-NOT IN" anti-pattern.  (A little more on that here.)

In short, try to avoid using Nested WHERE-IN clauses.  Learn to use proper JOINS and consider thinking carefully and planning your SQL statements.  Use derived tables to logically group things into pieces.  I promise that if you take the time to learn the proper joining techniques, you will not regret it.

(Update: more on all this here.)

see also:

Legacy Comments


Adam Machanic
2007-07-12
re: The "Nested WHERE-IN" SQL Anti-Pattern
I don't agree that this is an anti-pattern. Depending on indexes, you can get a different query plan with IN vs. an inner join (especially true in SQL Server 2000, not so much in 2005), and using nested WHERE IN might result in better performance. Best to test both. Plus, I think it's important that a query is written (whenever possible) to declaratively express exactly what you want. JOINs don't do that in the case of "find all customers that have ordered product 1". The nested WHERE IN does state exactly that -- show me all of those customers, and exactly those customers, with orders that included product 1. IMO, your extensibility examples seem contrived; I see no point in introducing a derived table "just in case" I might need to add another column later. If I need that, I can simply re-write the query...

Jeff
2007-07-12
re: The "Nested WHERE-IN" SQL Anti-Pattern
Adam -- I see your point, but I have yet to see any query where a nested IN is more efficient than an a JOIN, but I have definitely seen the opposite many, many times. Can you provide a specific example where a nested in performs better? I'm definitely curious, it would be very interesting to see something that performs better with all the INs, and I'm willing to revise my statements if we can find that scenario.

In terms of extensibility, if technique A always performs equally or better than technique B, they have approximately the same complexity, and technique A also has the benefit of better extensibility, I will personally stick with technique A. As you said, you can always completely re-write your SQL to add more info from the related tables using the nested WHERE-IN technique, but why not just write it that way in the first place?

thanks for your input, you make good points and I definitely don't entirely disagree.

Adam Machanic
2007-07-12
re: The "Nested WHERE-IN" SQL Anti-Pattern
Jeff,

I will see if I can find an example in SQL Server 2005. I distinctly recall fixing some issues in 2000 using IN instead of a join on a certain project. The join was producing a hash match, whereas the IN produced a nested loop, which was much more appropriate to the situation. Things seem more consistent in 2005. I'm also aware of at least one scenario in 2005 where IN is detremental to performance compared to a JOIN :) ... But I still use it, for the reasons I mentioned (as long as I'm not hitting the perf issues, of course!)


Jeff
2007-07-12
re: The "Nested WHERE-IN" SQL Anti-Pattern
Thanks, Adam -- any examples would be great! I rememebr reading quite a bit about IN versus EXISTS and JOINS in other blog posts recently, but I can't seem to find them ... maybe someone else can help out with some links?? :)

I definitely agree that for a specific SQL statement, if performance is not an issue, you should go with the shortest and most intuitive one, and in some cases WHERE-IN may indeed be the way to go. I've just seen so many beginners use it as a crutch to avoid learning or using JOINS and making big, complicated messes instead of writing simple SQL statements ....

Bart Czernicki
2007-07-12
re: The "Nested WHERE-IN" SQL Anti-Pattern
I agree with Adam. Also your performance "gains" I attribute more to the in-memory derived tables than joins.

Furthermore, remember in SQL 2008 Katmai you will be able to pass in table variables into sprocs. I don't think the "where-in" design is going to be going away just yet.

Just like anything with SQL, if you can write it and it works...u are pretty much 90% of the way clear. Unless performance is of the upmost importance. For example, I WASTED a ton of time before tuning stuff for temp tables vs. table variables and SQL 2005 came out and in some cases it works BACKWARDS. So, even having your queries designed/patterned/tuned one-way you might have to redo it next release I have learned.

Jeff
2007-07-12
re: The "Nested WHERE-IN" SQL Anti-Pattern
Thanks, Bart, I appreciate your comments, but can you explain what you mean by this:

>>Also your performance "gains" I attribute more to the in-memory derived tables than joins.

I am not sure what you are saying. Yes, the performance gains are due to the derived tables -- that's my entire point. Joining to derived tables instead of nested Where-In clauses.

Also, do you mind expanding on this:

>>Furthermore, remember in SQL 2008 Katmai you will be able to pass in table variables into sprocs. I don't think the "where-in" design is going to be going away just yet.

I am not sure what that has to do with this at all. You can join to a table variable or nest it in a WHERE clause just like anything else -- it really has absolutely no bearing at all on anything I've written.


gbn
2007-07-13
re: The "Nested WHERE-IN" SQL Anti-Pattern
What about composite keys: how do we IN on 2 columns?

The nested IN *may* work better for smaller tables (eg 100 rows)
But I'd like to see this scale up. What about IN (250,000 IDs)?

Adam Machanic
2007-07-13
re: The "Nested WHERE-IN" SQL Anti-Pattern
"IN" on 2+ columns == "EXISTS":

SELECT *
FROM Tbl1
WHERE EXISTS (SELECT * FROM Tbl2 WHERE Tbl2.Col1 = Tbl1.Col1 AND Tbl2.Col2 = Tbl1.Col2)

IN, EXISTS, and an inner join will generally be optimized the same way by the QP and produce the same exact plans. I have seen some counter-examples, as mentioned earlier in the comments... Best is to test all three when in doubt :)


gbn
2007-07-13
re: The "Nested WHERE-IN" SQL Anti-Pattern
EXISTS <> IN.

In your example, what if tbl2 is a child or tbl1?
Say 10 rows average in tbl2 for each row in tbl1.

EXISTS: do we have a row or not? = short circuit
IN: IN on (id, id, id, id, id..): does this short circuit? Does it de-duplicate?

Test:
90k rows Tbl1, 230k rows Tbl2

EXISTS 33% of the batch, IN 67% (SELECT *)
EXISTS 45%, In 55% with (SELECT NamedCOlumn in covering index of kex col)


ammoQ
2007-07-14
re: The "Nested WHERE-IN" SQL Anti-Pattern
The where-in antipattern is (in my experience) typical for ad-hoc sql that somehow makes it into the program.
Starting with the innermost SELECT, it's less work to add the outer layers than to turn it into a join.

Adam Machanic
2007-07-15
re: The "Nested WHERE-IN" SQL Anti-Pattern
gbn: No clue what you're talking about. Can you show us exactly what you're testing? DDL to create the tables/indexes, insert statements to populate them, and the exact SQL you're using.

Arnold Fribble
2007-07-16
re: The "Nested WHERE-IN" SQL Anti-Pattern
Don't know about faster, but I can find you an example of NOT IN (or NOT EXISTS) being 40 times slower in SQL Server 2005 than in SQL Server 2000:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=83651

Renato Sabo
2007-07-16
re: The "Nested WHERE-IN" SQL Anti-Pattern
Nice post, but I do have to speak against something you said.

"The other problem with this technique is that it leads you to believe that every table in your database must have a single column primary key. This leads to bad database designs where tables that relate entities don't have proper composite primary keys, but just meaningless identities."

I do believe you are familiar with the concept of surrogate keys in database. I do generally use surrogate keys, with some very loose exceptions. I guess within this matter, there will be plenty of people saying that meaning pKs are betters and all... but I still found, for my apps that the surrogate model works best.

Jeff
2007-07-16
re: The "Nested WHERE-IN" SQL Anti-Pattern
Renato -- thanks for the comment. If I have a table of Customers, PK of CustomerID, and a table of Consultants, pk of ConstultantID, and I want to create a table that relates many customers to many consultants, what should the PK of the CustomerConsultants table be? An identity, or a composite primary key?

If you say identity, then you need an additional constraint on CustomerID/ConsultantID, and your PK is completely extra and meaningless.

If you say a composite PK of CustomerID/ConsultantID, then you are correct and you have a nice, solid, simple database design with good integrity.

People can argue all they want about whether entity tables should have a surrogate or natural PK, that's fine and I understand the pros and cons of each; but when a table relates two or more entities, it's pK should the composite PK's of those entities -- pretty much no exceptions there. To *not* use an intelligent, meaningful PK that constrains and defines the data in that case would be a bad database design.

John Fx
2007-07-20
re: The "Nested WHERE-IN" SQL Anti-Pattern
I think this article and many of the responses illustrates beautifully how the language of SQL hasn't lived up to its full potential. We chide developers repeatedly for treating SQL as a procedural language instead of a declarative language, then hypocritically nitpick the code in procedural terms such as whether joins are nested, or specified in the WHERE or FROM clause.

I hate the dictionary quoters, but from the all mighty wikki god the following definition pretty much sums up my opinion on this:
"A declarative programming language is a high-level language that describes a problem rather than defining a solution... Data-oriented programming languages and query languages are declarative in that queries are not given in terms of how to find data but instead give criteria for the desired data (SQL is an example)."

In a nutshell, the SQL optimizer is supposed to handle the procedural elements of whether you grab data from a second table using an INNER JOIN, an IN statement, the WHERE clause, etc. As for the whole arguments about index scans, correlated subqueries, other SQL innards gobeldy-gook, where is my d*mn abstraction layer! I should be able to write the most attrocious CURSOR heavy, TEMP table using, 20 level deep nested subquery and let the SQL optimizer in all its infinite wisdom solve the problem for the best performance.

You may burn me at the stake now....


Keith
2007-11-30
re: The "Nested WHERE-IN" SQL Anti-Pattern
i am using sql server 2005 and

SELECT *

FROM

table t1

WHERE

(SELECT
t2.id
FROM
table2 t2
WHERE
t2.id LIKE 'some string'
) = t1.id

is much faster than

SELECT *

FROM

table t1
INNER JOIN table t1 ON t1.id = t2.id

WHERE
t2.id LIKE 'some string'


Jeff Smith
2007-11-30
re: The "Nested WHERE-IN" SQL Anti-Pattern
Keith -- Can you post a short, simple, actual example (with DDL, DML, and the actual SELECT) to demonstrates this?

nickdu
2008-03-26
re: The "Nested WHERE-IN" SQL Anti-Pattern
First of all, I'm not a DBA. I'm a developer and know enough to be able to get by on the database side. My initial thought was that nested selects looked pretty bad and my guess is that they would not perform as well as a join. I have not done any testing regarding this, and don't really plan to. After reading John Fx's post a thought occurred to me. Why is any of this necessary? If I have created the relationships then SQL should be able to determine how to grab the data. I should be able to write the query something like this:

select * from Customers where [Order Details].ProductID=1

If the table from the where clause does not match the table from the select then SQL should use the relationships to determine the join. If there is no relationship or the path between the two tables is ambiguous then an error should be returned.

Thato
2008-08-11
re: The "Nested WHERE-IN" SQL Anti-Pattern
I need to know how to DISTINCTIVELY return more than one column in the same table.