Why I Hate DISTINCT

You know... that "handy" keyword that eliminates duplicates from your result set.  Yeah, that DISTINCT.  I Hate it!  My team thought I was crazy (maybe I am, but this is not proof of it).  I am frequently railing against the use of DISTINCT by any of my developers.  "It's a sign of weakness!", I would say.  Or, "It just shows you don't know how to write SQL!"

Okay, maybe hate is too strong of a word.  But I certainly dislike its use within my applications.  Actually, today I slowed down long enough to put into coherent terms what my issue is with the DISTINCT keyword.  The truth is that I really dislike its over-use.  Really, DISTINCT is not evil by itself.  It's just a tool.  Like a gun.  Or money.  The tool is not evil in and of itself, but in the wrong hands, it can be used for evil.  And in the right hands, it can be used for good.

The Good

If I need a quick list of the states where we currently have business, I might issue a statement like this

SELECT DISTINCT OH.State 
FROM OrderHeader OH

Great!  It gives me the list.  No muss, no fuss.  I have no problem with the rapid, ad-hoc usage of DISTINCT like this.  But I find that 9 times out of 10, I am more likely to use a GROUP BY because I also want to know something about those states such as which are most active (i.e. how many orders are in each state) or which generate the most revenue, so it is far more likely that I will use a statement like

SELECT OH.State, COUNT(*)
FROM OrderHeader OH
GROUP BY OH.State

The Bad

Where things start to go sour in my opinion is when a developer is building substantial query, joining tables together, and all of a sudden he realizes that it looks like he is getting duplicate (or even more) rows and his immediate response...his "solution" to this "problem" is to throw on the DISTINCT keyword and POOF all his troubles go away.  I hope that to you, the reader, this sounds ridiculous, but unfortunately I have been in many a session where I hear this "solution" offered up.

Here's my concern...this is what is really behind me lashing out verbally at the poor, helpless developer...He does not know why there are duplicates, only that he wants to get rid of them in order to move on to the next development task.  But I want to know why.  Is a table ill-defined and we actually have duplicate data?  Is a JOIN ill-defined and the developer really needs more fields in the ON statement or other additional criteria in the JOIN clause?  Are there other columns that really should be included in the result set in order to distinguish between two similar-looking rows?

I find the additional fields in the ON statement a common error.  Too often developers want to just join two tables together based on one field in each table (FK to PK) but many times you need more.  Perhaps the table has a multi-part Primary Key.  Or perhaps your one column is just not distinct enough on its own (pardon the reuse of the word).  For example if you are talking geographically, you cannot just join two tables on City, or you may end up with records for Portland, Oregon joined to records from Portland, Maine.  Not a pretty sight.  In this case you need to JOIN ON both City and State.

The Summary

So here's my tip to you.  Whenever you have the urge to use the DISTINCT keyword, stop for a couple of minutes and ask yourself, "WHY do we have duplicates in the results"?  And more importantly, "HOW can I fix my query without resorting to the DISTINCT hack?"

posted @ Tuesday, November 11, 2008 11:43 PM

Print

Comments on this entry:

# re: Why I Hate DISTINCT

Left by Damian at 11/12/2008 12:37 AM
Gravatar
This can be further abstracted by saying "I hate idiots" :)

# re: Why I Hate DISTINCT

Left by Clarky at 11/12/2008 5:25 AM
Gravatar
Well, this article only points out to me that you are certainly not the department manager. You see, developers have be so ass whipped by management and sales people deciding on how the code should be written and how long it should take to write, a 'distinct' can be a great friend. A developer doesn't care that some over paid marketing head bought a junk list of data with tons of dupes. He just wants to load it and get on with his life. Any intelligent fight has probably been been beaten out of him. Developers are just trying to keep their hours under 60 a week and occasionally see sunlight. If their code represents a set of data that is incorrect or not producing results in a timely fashion, go ahead and bust their stones othewise I say. Rally the Distinct.

# re: Why I Hate DISTINCT

Left by Adam Machanic at 11/12/2008 11:47 AM
Gravatar
I hate DISTINCT, too, at least improperly used. My rule is to use it at the lowest possible level. Example:

"Which products did each customer buy?"

--Bad:
SELECT DISTINCT p.ProductName, c.CustomerName FROM Customers c, Products p, Orders o WHERE o.CustomerId = c.CustomerId AND o.ProductId = p.ProductId

--Much Better:
SELECT
p.ProductName,
c.CustomerName
FROM
(
SELECT DISTINCT
o.CustomerId,
o.ProductId
FROM Orders o
) x
JOIN Customers c ON c.CustomerId = x.CustomerId
JOIN Products p ON p.CustomerId = x.CustomerId

... although I agree, COUNT(*) might give more value here...

# re: Why I Hate DISTINCT

Left by jeff at 11/12/2008 12:28 PM
Gravatar
Nice article, Mark. I completely agree! A while back I wrote some similar to points to yours regarding DISTINCT here:

http://weblogs.sqlteam.com/jeffs/archive/2006/03/14/9289.aspx

# re: Why I Hate DISTINCT

Left by georgev at 11/13/2008 4:41 AM
Gravatar
Passed this article to a couple of colleagues.
Cheers

# re: Why I Hate DISTINCT

Left by Hey at 11/13/2008 12:14 PM
Gravatar
Hi Jeff,

Thanks for the Article. But this mistake everyone (including DBA) is doing.

Cheers
Hey

# re: Why I Hate DISTINCT

Left by Dave Edwards at 11/14/2008 11:22 AM
Gravatar

# re: Why I Hate DISTINCT

Left by AjarnMark at 11/14/2008 10:18 PM
Gravatar
Hmmm... looks like I hit a nerve here. Thanks for the great comments, folks! A few responses:

Damian: Good to see you didn't drop completely off the planet. ;-)

Adam: Good example!

Jeff: GREAT rules to live by!

Clarky: Yes, I am the manager; and I'm certainly glad I don't work in the environment you describe.

# re: Why I Hate DISTINCT

Left by Dave Holbon at 11/16/2008 5:37 PM
Gravatar
After many years of using the Access version of SQL, which is very forgiving, I moved over to SQL CE about three years ago developing Mobile apps.

The DISTINCT keyword here (which also pulls out nulls) I have never used since, along with a host of other things.

The application table design, indexes and normalisation of the database itself requires an entirely different approach as speed of execution in limited memory resources are paramount.

Properly designed databases with unique keys in place should never produce duplicate records that can’t be overcome by subsequent WHERE statements. If they do then the table design is incorrect or the query design is wrong.

# re: Why I Hate DISTINCT

Left by Peter at 11/16/2008 5:40 PM
Gravatar
If you use DISTINCT it's tell me that you don't really understand why there are duplicates and you don't really care, you don't care about the integrity of the database, all you care about it getting your unique list of things. This applies for the cases where distinct is misused, i.e. most of the places.

# re: Why I Hate DISTINCT

Left by Lee Crain at 11/16/2008 5:41 PM
Gravatar
Having been a developer and having turned away from the dark side of the force, (I am now a DBA), I can say with authority that most developers consider databases a nuisance, a distraction from the real business of coding, and they can't be bothered with questions about why there are duplicates when there should be none. So, they do what most coders do: they ignore the problem, do a quick and dirty workaround, and move on.

I found, during my 23 year career as a software developer, that this arrogant and irresponsible mindset permeated most of the software development community, especially among developers under 30 years of age.

If my disgust for these people sounds like it knows no bounds, you have perceived me correctly.

LC

# re: Why I Hate DISTINCT

Left by Robert at 11/17/2008 1:54 AM
Gravatar
Developers are dark side?
Well, it's true that most developers are coders, not programmers. A real programmer always tries to create a masterpiece of code, meaning code that is easily reusable, extensible and maintainable (as much as possible regarding time and other constraints), has excellent problem solving skills and attention to details. A real programmer will perform well in strictly programming as well as any related tasks like database design (tables, integrity, views, triggers, stored procs,... not just client sql).

You can't expect a novice to perform well from the beginning, but most lack potential, either talent, real interest or both, they just see good job opportunity. But are they dark side because of this? No, I'd say force insensitive crowd.

You have to be from special breed to become a real programmer, as well as jedi. You need talent, be force sensitive, but that's just prerequisite, you need a lot of effort to fulfill your destiny. Dark side is a real programmer who forgets ethics and moral principles and dives into hate, rage and greed. Dark side is not stronger, it's just quicker, easier, more seductive. :)

# re: Why I Hate DISTINCT

Left by Paul DB at 11/17/2008 5:59 AM
Gravatar
Well said, Robert!

# re: Why I Hate sterotypes

Left by Corporate Survivor at 11/17/2008 9:19 AM
Gravatar
Nice job of venting and trashing people who try to get the job done so they can move on to handle the "big ticket items" required by managment. When you have a deadline and 6 months of work in your in box, why spend 2 hours writing the perfect query for a complicated report that only runs once a month? Sometimes its a matter of expediency and time to move on.

You've also defined the "dark side" without considering that programmers are doing what management wants. Handle the big ticket items that provide business value so marketing and user deparments can have thier important software. When does management put time in the project for DBA considered "perfect queries". Perhaps this is why some DBA's get trashed every now and then, if programmers spend thier time writing perfect queries when can we work on business logic, GUI's and new features?

Before you trash the guy in the next cubicle try to show some ledership and understanding before you bash his methods of corporate survival.

# re: Why I Hate DISTINCT

Left by Ivan Budiono at 11/17/2008 11:02 PM
Gravatar
Well, improper use of "distinct" not only hides the real problem (duplicate entries in the tables, lack of condition in the on clause) as discussed above but also degrades query performance. Because of the lack of condition, the query engine needs to examine more data than necessary and mostly result in inefficient table scan instead of index seek. This will bring the IO Cost (logical reads) of the query goes up significantly. Then later on you will realize that the query times out!

# re: Why I Hate DISTINCT

Left by SQLDBA at 11/18/2008 9:01 AM
Gravatar
Although the DISTINCT keyword does have its place in the SQL language, it has been my experience that those developers that use it frequently also spend a significant amount of time revisiting their code (SQL and application code) since they don't understand the additional benefits of using correct JOINs and the GROUP BY clause. Not a slam, but merely an observation.
Comments have been closed on this topic.
«April»
SunMonTueWedThuFriSat
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910