Everything worked fine for the last six years…and then today we hit the tipping point. I lost many hours today due to the misuse of DISTINCT which was the root cause of timeout errors in our web application.
In one of my most frequently read posts, Why I Hate Distinct, I talk about how this SQL keyword is often abused by SQL developers who are just trying to get their job done, but don’t know enough about either the SQL language or the data architecture to really do it right. Now, I realize that in the development world (of which I am a part, not just a spectator) that using a phrase like “do it right” can be fighting words. In software development, there are often many different ways to accomplish the same goal and what may be right in one situation or environment may not be right in another. But there are also certain principles which should not be undermined or overridden without some serious analysis. One set of those principles in the relational database world are the rules of Normalization, “the key, the whole key, and nothing but the key, so help me Codd.”. These principles are important for preventing duplicates in your data. What does the DISTINCT keyword do? It removes duplicates from your query result set. This question that should now be coming to mind is, “If I have a normalized (no duplicates) database, why would I end up with duplicates in my result set?”. An excellent question! The simple answer is that you shouldn’t, and if you do, I think you need to analyze your query, not just slap the DISTINCT Band-Aid on it.
But as I mentioned in my opening paragraph, today this became an issue about much more than undermining principles or my sense of right and wrong. Today it became an issue of query performance, and I spent several hours trying to figure out what went wrong, including working late on one of the most beautiful Friday nights we have had in a long, long time, and missing dinner with my family or getting to see my kids before they went to bed. For me, this was a large price to pay, all brought about by one developer’s carelessness or ignorance. (Hey, what can I say? Once I got married and had a family, I suddenly didn’t think staying at work all night working on a puzzle was all that cool any more.) This performance effect of the misuse of the DISTINCT keyword was actually mentioned in a comment on that original post by Ivan Budiono, but I had not seen it in action until now.
In my story, it was an odd, but fortunate, coincidence that this problem showed up when it did. This internal web application has been running for more than six years with very little change to this particular web page, which normally only takes about 4 or 5 seconds to gather its data and render, so it was very surprising this morning when it started throwing SQL Timeout errors; all the more so, knowing that we had nightly jobs to address index fragmentation and to update statistics. The business users first jumped to the conclusion that something went wrong with the newest version release which we deployed last night, but I knew that there were no changes in there related to this particular web page nor the stored procedure that was timing out. We had also purged a little bit of bad data from one of the related tables, and so some users thought the problem was caused by that, but again I was skeptical of their guesses. I find it a good practice, when you are playing the role of troubleshooter, to be skeptical of guesses and people’s memory, and simply focus on the things that you can prove to be fact.
Fortunately, since I had just done the software deployment the night before, I had before and after backups of the database. So, I restored the “before” image under a different name (we are fortunate to have sufficient hardware that I could put this on the same instance as the live system to eliminate that variable in my analysis and not significantly impact the live system). Sure enough, executing the stored procedure in the “before” ran without trouble, returning results in a few seconds, but executing it in the current live database it ran nearly a minute. So then I turned on Actual Execution Plans on both and they were remarkably different. The before image was full of index seeks and put the tables together one way, but the after plan had a couple of time-consuming Clustered Index Scans, and interestingly, arranged the pieces in a very different way.
Now the research starts to get fun. Since we have the very handy before and after copies of the database, let’s see if there is something different between them that might be causing the issue like an index got dropped. So, I fire up Red Gate’s SQL Compare tool. Everything looks good. There are differences, but only the few changes I was expecting from the software update we released, and nothing that stood out as an issue. Specifically, there was no difference in the tables involved, index definitions for those tables, or the stored procedure. Hmmm…what else could it be? Well, I knew that the business was doing some bulk importing of data into tables related in the queries, but they have done that before without a problem. Nonetheless, I’m running out of ideas, so let’s see just how much has changed in the data. So, I open up Red Gate’s SQL Data Compare tool and take a look.
Well, there were about 2900 rows added to a table that has 40,000 rows (about 7% growth). That doesn’t sound too bad. And another table involved had 4400 rows added to its nearly 100,000 rows, again a small percentage of change. Seems hard to imagine that this little amount of data could make such a big difference, but it is one of the few things that I can verify are different between the databases and related to the query. If only there was a way to replay that data change to see if it caused the problem…Wait! There is! SQL Data Compare will not only show you the difference between two databases, it also allows provides a Synchronization wizard in order to make the data changes necessary to make them the same. Even better, you can do it step by step, one table at a time, one execution type (Insert / Update / Delete) at a time. So I did that; I methodically ran through the data changes on only the tables involved in the query, and sure enough, when the before now looked like the after from the bulk loaded data perspective, the stored procedure there suddenly started taking much longer to run and gave the ugly execution plan instead of the efficient one. And one of the weird things that the execution plan was showing was that two sources of a Hash Match Join brought 43,000 and 2,900 rows respectively, and the output was 2.1 million rows. But the final result set was only 365 rows. I figured that was not a good sign.
Armed with that information, I knew I was going to have to tackle the stored procedure to see if I could rewrite it in a more efficient manner. It really only had two queries to it, each with about 8 tables JOINED together. My first thought was that it was fairly ugly, so I ran SQL Prompt’s Format SQL feature to clean up the layout. (With all these references to Red Gate tools, you might think I get paid for promoting their stuff. I don’t. It just happens to be that they make good tools that really helped me today.) After the code was formatted the way I like to see it, it was still ugly, but this time strictly from a structural or syntactical point of view. I noticed the DISTINCT clauses right away, but also spent quite a bit of time reading the code and running mental checks on all of the JOINs and WHEREs. Along the way, I made a note to myself (commented the code) that one of the tables would normally be joined on two fields, and not just one. After doing that study, I decided it was time to break this thing apart and see what was in it. The first thing I wanted to do was get rid of that DISTINCT and see what came out. That immediately changed my result set from 365 rows to nearly 17,000 rows with a whole bunch of “duplicates”. I did some counting of how many duplicates of one key I was getting and thinking about what may cause that when I saw that note to myself on the JOIN criteria and decided to clean that up, too while I was there. So, I quickly added the second field to the JOIN, re-ran the query to see how much difference it made, and voila! I was back to my target of 365 rows, and performance was dramatically better. I made that change to the stored procedure in the before snapshot (with data loaded) that I was using for testing, and sure enough, results came back in just a few seconds and the execution plan was back much closer to the original, so I did an emergency update of the live system with that change and everything was back to working like it was supposed to.
I’m still surprised that such a relatively little amount of data would cause such a drastic change, but I guess that is explained by the multiplying effect of that bad JOIN syntax. There are still some improvements that I think could be made in that procedure to make it more maintainable, and probably more efficient still. And there are a couple of new indexes that I am going to explore to tune this up even more, but this was enough to get us back in business and allow me to get out the door to go have a late dinner with my wife.
And as you can imagine, I still hate DISTINCT.