Ajarn Mark Caldwell Blog

Bringing Business Sense to the IT World…

The Tipping Point - DISTINCT Causes Timeouts

Everything worked fine for the last six years…and then today we hit the tipping point.  I lost many hours today due to the misuse of DISTINCT which was the root cause of timeout errors in our web application.

In one of my most frequently read posts, Why I Hate Distinct, I talk about how this SQL keyword is often abused by SQL developers who are just trying to get their job done, but don’t know enough about either the SQL language or the data architecture to really do it right.  Now, I realize that in the development world (of which I am a part, not just a spectator) that using a phrase like “do it right” can be fighting words.  In software development, there are often many different ways to accomplish the same goal and what may be right in one situation or environment may not be right in another.  But there are also certain principles which should not be undermined or overridden without some serious analysis.  One set of those principles in the relational database world are the rules of Normalization, “the key, the whole key, and nothing but the key, so help me Codd.”.  These principles are important for preventing duplicates in your data.  What does the DISTINCT keyword do?  It removes duplicates from your query result set.  This question that should now be coming to mind is, “If I have a normalized (no duplicates) database, why would I end up with duplicates in my result set?”.  An excellent question!  The simple answer is that you shouldn’t, and if you do, I think you need to analyze your query, not just slap the DISTINCT Band-Aid on it.

But as I mentioned in my opening paragraph, today this became an issue about much more than undermining principles or my sense of right and wrong.  Today it became an issue of query performance, and I spent several hours trying to figure out what went wrong, including working late on one of the most beautiful Friday nights we have had in a long, long time, and missing dinner with my family or getting to see my kids before they went to bed.  For me, this was a large price to pay, all brought about by one developer’s carelessness or ignorance.  (Hey, what can I say?  Once I got married and had a family, I suddenly didn’t think staying at work all night working on a puzzle was all that cool any more.)  This performance effect of the misuse of the DISTINCT keyword was actually mentioned in a comment on that original post by Ivan Budiono, but I had not seen it in action until now.

In my story, it was an odd, but fortunate, coincidence that this problem showed up when it did.  This internal web application has been running for more than six years with very little change to this particular web page, which normally only takes about 4 or 5 seconds to gather its data and render, so it was very surprising this morning when it started throwing SQL Timeout errors; all the more so, knowing that we had nightly jobs to address index fragmentation and to update statistics.  The business users first jumped to the conclusion that something went wrong with the newest version release which we deployed last night, but I knew that there were no changes in there related to this particular web page nor the stored procedure that was timing out.  We had also purged a little bit of bad data from one of the related tables, and so some users thought the problem was caused by that, but again I was skeptical of their guesses.  I find it a good practice, when you are playing the role of troubleshooter, to be skeptical of guesses and people’s memory, and simply focus on the things that you can prove to be fact.

Fortunately, since I had just done the software deployment the night before, I had before and after backups of the database.  So, I restored the “before” image under a different name (we are fortunate to have sufficient hardware that I could put this on the same instance as the live system to eliminate that variable in my analysis and not significantly impact the live system).  Sure enough, executing the stored procedure in the “before” ran without trouble, returning results in a few seconds, but executing it in the current live database it ran nearly a minute.  So then I turned on Actual Execution Plans on both and they were remarkably different.  The before image was full of index seeks and put the tables together one way, but the after plan had a couple of time-consuming Clustered Index Scans, and interestingly, arranged the pieces in a very different way.

Now the research starts to get fun.  Since we have the very handy before and after copies of the database, let’s see if there is something different between them that might be causing the issue like an index got dropped.  So, I fire up Red Gate’s SQL Compare tool.  Everything looks good.  There are differences, but only the few changes I was expecting from the software update we released, and nothing that stood out as an issue.  Specifically, there was no difference in the tables involved, index definitions for those tables, or the stored procedure.  Hmmm…what else could it be?  Well, I knew that the business was doing some bulk importing of data into tables related in the queries, but they have done that before without a problem.  Nonetheless, I’m running out of ideas, so let’s see just how much has changed in the data.  So, I open up Red Gate’s SQL Data Compare tool and take a look.

Well, there were about 2900 rows added to a table that has 40,000 rows (about 7% growth).  That doesn’t sound too bad.  And another table involved had 4400 rows added to its nearly 100,000 rows, again a small percentage of change.  Seems hard to imagine that this little amount of data could make such a big difference, but it is one of the few things that I can verify are different between the databases and related to the query.  If only there was a way to replay that data change to see if it caused the problem…Wait!  There is!  SQL Data Compare will not only show you the difference between two databases, it also allows provides a Synchronization wizard in order to make the data changes necessary to make them the same.  Even better, you can do it step by step, one table at a time, one execution type (Insert / Update / Delete) at a time.  So I did that; I methodically ran through the data changes on only the tables involved in the query, and sure enough, when the before now looked like the after from the bulk loaded data perspective, the stored procedure there suddenly started taking much longer to run and gave the ugly execution plan instead of the efficient one.  And one of the weird things that the execution plan was showing was that two sources of a Hash Match Join brought 43,000 and 2,900 rows respectively, and the output was 2.1 million rows.  But the final result set was only 365 rows.  I figured that was not a good sign.

Armed with that information, I knew I was going to have to tackle the stored procedure to see if I could rewrite it in a more efficient manner.  It really only had two queries to it, each with about 8 tables JOINED together.  My first thought was that it was fairly ugly, so I ran SQL Prompt’s Format SQL feature to clean up the layout.  (With all these references to Red Gate tools, you might think I get paid for promoting their stuff.  I don’t.  It just happens to be that they make good tools that really helped me today.)  After the code was formatted the way I like to see it, it was still ugly, but this time strictly from a structural or syntactical point of view.  I noticed the DISTINCT clauses right away, but also spent quite a bit of time reading the code and running mental checks on all of the JOINs and WHEREs.  Along the way, I made a note to myself (commented the code) that one of the tables would normally be joined on two fields, and not just one.  After doing that study, I decided it was time to break this thing apart and see what was in it.  The first thing I wanted to do was get rid of that DISTINCT and see what came out.  That immediately changed my result set from 365 rows to nearly 17,000 rows with a whole bunch of “duplicates”.  I did some counting of how many duplicates of one key I was getting and thinking about what may cause that when I saw that note to myself on the JOIN criteria and decided to clean that up, too while I was there.  So, I quickly added the second field to the JOIN, re-ran the query to see how much difference it made, and voila! I was back to my target of 365 rows, and performance was dramatically better.  I made that change to the stored procedure in the before snapshot (with data loaded) that I was using for testing, and sure enough, results came back in just a few seconds and the execution plan was back much closer to the original, so I did an emergency update of the live system with that change and everything was back to working like it was supposed to.

I’m still surprised that such a relatively little amount of data would cause such a drastic change, but I guess that is explained by the multiplying effect of that bad JOIN syntax.  There are still some improvements that I think could be made in that procedure to make it more maintainable, and probably more efficient still.  And there are a couple of new indexes that I am going to explore to tune this up even more, but this was enough to get us back in business and allow me to get out the door to go have a late dinner with my wife.

And as you can imagine, I still hate DISTINCT.

Legacy Comments

re: The Tipping Point - DISTINCT Causes Timeouts
Nice recap Mark. So, given that within a well-developed database and using a well-developed query a DISTINCT is not needed, then playing the Devil’s advocate, why even allow it? One has to wonder why the keyword and functionality is even supported by relational database Structured Query Languages. Not defending the use of “DISTINCT” at all, but is there any application when it makes sense? Or is it always a last-resort play, an end-around when working with poor database design? Still, why support it even then? Is this industry acceptance of poor database design?

re: The Tipping Point - DISTINCT Causes Timeouts
@MRG – I could understand why `DISTINCT` might be handy if you were using SQL Server for ETL. I know that when we import data from other systems we often need to find and remove duplicates.

re: The Tipping Point - DISTINCT Causes Timeouts
You can't always include "the key, the whole key, and nothing but the key" in your queries. Therefore, we will always need DISTINCT, or at least GROUP BY, for our unique needs.

re: The Tipping Point - DISTINCT Causes Timeouts
There are legitimate uses for DISTINCT. For example, maybe you want to see a list of all customers who ordered something in July. Some customers might have placed many orders, but you just want to see each customer once.

However, Mark is dead-on about how DISTINCT gets abused - I see this so often, DISTINCT is basically a synonym for "BAD QUERY" to me now.

What happened to Mark was a developer made a bad JOIN in the original query. The developer noticed the results weren't what he expected, so he slapped a DISTINCT on his query and it "looked right" after that, so the code got deployed.

Here's the interesting part. The query was never right. Notice Mark said the stored procedure wasn't changed, only data in the source table. Once upon a time the data in the tables appeared to play nice with the DISTINCT, but this was a disaster waiting to happen.

For example, maybe your query works as long as all customers only have orders shipped to their home address, but once Christmas rolls around and people start using multiple shipping addresses, everything falls apart.

To me, the most important part of query writing is focusing on cardinality - knowing which JOINs are one-to-one and which are one-to-many.

The first thing I do when writing a query against tables I'm not familiar with is to figure out what columns I am joining on, and then I study which tables are unique on those columns and which aren't. Once you know that, you can design a proper query, and better yet, recognize when the results don't match your expectations.

Jacob Wagner
re: The Tipping Point - DISTINCT Causes Timeouts
I use distinct fairly often when I am receiving non-normalized data from an external system that I don't have control over. Group By will do the job but when you have a dozen (or more) columns it is at the very least tedious.

re: The Tipping Point - DISTINCT Causes Timeouts
In one of production servers, there is a query which involves in deadlocks regularly... And programmers are asking why? :))

Select distinct top 5000 ....
order by .. some 6 fields here

DBA's life is a hard life :)