A quick follow up to the
"Nested WHERE-IN" anti-pattern post from yesterday ... If you didn't get a chance, be sure to read the comments from that post as well, there are some great points in there so far.
Going back to Northwind and our example of returning all customers that have ordered productID 1, let's assume that we didn't check the schema carefully and we did not realize that the CustomerID column is not in the Order Details table. So, we might try to write our SELECT like this:
select *
from customers
where customerID in
(select customerID from [order details] where productID = 1)
Now, try running that ... it works! Well, sort of ... it runs, returns results, generates no errors -- and it produces the
wrong results! All customers are returned, not just those that ordered productID 1. This is an issue with correlated sub-queries in general, in that if a column is not prefixed with a table name, and it doesn't exist in the table(s) in the sub-query, SQL will assume it is a reference to a column in the parent query. In other words, the above SQL is checking to see if customerID is equal to itself, so it returns all of the results. This is completely logical and makes perfect sense from a processing perspective (I would not call it a bug), but it can lead to confusion when programmers are not aware of why and how this can happen.
If we try this with a standard JOIN:
select customers.*
from customers
inner join
(select distinct customerID
from [order details]
where productID = 1
) x on x.customerID = customers.customerID
we get:
Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'customerID'.
Which is what we expect -- an error message telling us that we are trying to use a column that does not exist in the table.
Perhaps this is a minor issue, but it does happen all the time, and it can be maddening to troubleshoot .. especially for beginners just learning the language. As mentioned, this is a general issue with correlated sub-queries in general, and part of the reason why I prefer to use derived tables.
When using derived tables, each sub-SELECT is fully self-contained and can contain no references to anything outside of it (except for parameters or variables); you can cut and paste that SELECT into a separate Query Analyzer window, run it, tweak it, test it, and so on, and when you are done, you can paste it right back in to your original SELECT confident that it works and does its job. To me, this is a huge benefit and a really great way to break down complicated SQL statements into small, more manageable parts. (more on that
here.) I feel that learning to do this is crucial for a good SQL programmer, and using WHERE-IN and correlated subqueries as a first instinct instead doesn't cultivate a clean "break the problem down into smaller parts" mindset, but instead leads to a thought process of "keep adding things to the SELECT until it works."
In general, just remember that while WHERE-IN clauses do have their place, learning to think in terms of JOINS and derived tables is a really important skill to acquire when working with SQL. Sometimes, beginners find the "Nested WHERE-IN" technique to be most natural, but if instead you focus on learning JOINS and relations and derived tables, you will be a much better SQL programmer in the long run.