You know... that "handy" keyword that eliminates duplicates from your result set. Yeah, that DISTINCT. I Hate it! My team thought I was crazy (maybe I am, but this is not proof of it). I am frequently railing against the use of DISTINCT by any of my developers. "It's a sign of weakness!", I would say. Or, "It just shows you don't know how to write SQL!"
Okay, maybe hate is too strong of a word. But I certainly dislike its use within my applications. Actually, today I slowed down long enough to put into coherent terms what my issue is with the DISTINCT keyword. The truth is that I really dislike its over-use. Really, DISTINCT is not evil by itself. It's just a tool. Like a gun. Or money. The tool is not evil in and of itself, but in the wrong hands, it can be used for evil. And in the right hands, it can be used for good.
If I need a quick list of the states where we currently have business, I might issue a statement like this
SELECT DISTINCT OH.State
FROM OrderHeader OH
Great! It gives me the list. No muss, no fuss. I have no problem with the rapid, ad-hoc usage of DISTINCT like this. But I find that 9 times out of 10, I am more likely to use a GROUP BY because I also want to know something about those states such as which are most active (i.e. how many orders are in each state) or which generate the most revenue, so it is far more likely that I will use a statement like
SELECT OH.State, COUNT(*)
FROM OrderHeader OH
GROUP BY OH.State
Where things start to go sour in my opinion is when a developer is building substantial query, joining tables together, and all of a sudden he realizes that it looks like he is getting duplicate (or even more) rows and his immediate response...his "solution" to this "problem" is to throw on the DISTINCT keyword and POOF all his troubles go away. I hope that to you, the reader, this sounds ridiculous, but unfortunately I have been in many a session where I hear this "solution" offered up.
Here's my concern...this is what is really behind me lashing out verbally at the poor, helpless developer...He does not know why there are duplicates, only that he wants to get rid of them in order to move on to the next development task. But I want to know why. Is a table ill-defined and we actually have duplicate data? Is a JOIN ill-defined and the developer really needs more fields in the ON statement or other additional criteria in the JOIN clause? Are there other columns that really should be included in the result set in order to distinguish between two similar-looking rows?
I find the additional fields in the ON statement a common error. Too often developers want to just join two tables together based on one field in each table (FK to PK) but many times you need more. Perhaps the table has a multi-part Primary Key. Or perhaps your one column is just not distinct enough on its own (pardon the reuse of the word). For example if you are talking geographically, you cannot just join two tables on City, or you may end up with records for Portland, Oregon joined to records from Portland, Maine. Not a pretty sight. In this case you need to JOIN ON both City and State.
So here's my tip to you. Whenever you have the urge to use the DISTINCT keyword, stop for a couple of minutes and ask yourself, "WHY do we have duplicates in the results"? And more importantly, "HOW can I fix my query without resorting to the DISTINCT hack?"
posted @ Tuesday, November 11, 2008 11:43 PM