Troubleshooting Application Timeouts in SQL Server
I recently received the following email from a blog reader:
"We are having an OLTP database instance, using SQL Server 2005 with little to moderate traffic (10-20 requests/min). There are also bulk imports that occur at regular intervals in this DB and the import duration ranges between 10secs to 1 min, depending on the data size. Intermittently (2-3 times in a week), we face an issue, where queries get timed out (default of 30 secs set in application). On analyzing, we found two stored procedures, having queries with multiple table joins inside them of taking a long time (5-10 mins) in getting executed, when ideally the execution duration ranges between 5-10 secs. Execution plan of the same displayed Clustered Index Scan happening instead of Clustered Index Seek. All required Indexes are found to be present and Index fragmentation is also minimal as we Rebuild Indexes regularly alongwith Updating Statistics. With no other alternate options occuring to us, we restarted SQL server and thereafter the performance was back on track. But sometimes it was still giving timeout errors for some hits and so we also restarted IIS and that stopped the problem as of now."
Rather than respond directly to the blog reader, I thought it would be more interesting to share my thoughts on this issue in a blog.
There are a few things that I can think of that could cause abnormal timeouts:
- Blocking
- Bad plan in cache
- Outdated statistics
- Hardware bottleneck
To determine if blocking is the issue, we can easily run sp_who/sp_who2 or a query directly on sysprocesses (select * from master..sysprocesses where blocking <> 0). If blocking is present and consistent, then you'll need to determine whether or not to kill the parent blocking process. Killing a process will cause the transaction to rollback, so you need to proceed with caution. Killing the parent blocking process is only a temporary solution, so you'll need to do more thorough analysis to figure out why the blocking was present. You should look into missing indexes and perhaps consider changing the database's isolation level to READ_COMMITTED_SNAPSHOT.
The blog reader mentions that the execution plan shows a clustered index scan when a clustered index seek is normal for the stored procedure. A clustered index scan might have been chosen either because that is what is in cache already or because of out of date statistics. The blog reader mentions that bulk imports occur at regular intervals, so outdated statistics is definitely something that could cause this issue. The blog reader may need to update statistics after imports are done if the imports are changing a lot of data (greater than 10%). If the statistics are good, then the query optimizer might have chosen to scan rather than seek in a previous execution because the scan was determined to be less costly due to the value of an input parameter. If this parameter value is rare, then its execution plan in cache is what we call a bad plan. You want the best plan in cache for the most frequent parameter values. If a bad plan is a recurring problem on your system, then you should consider rewriting the stored procedure. You might want to break up the code into multiple stored procedures so that each can have a different execution plan in cache.
To remove a bad plan from cache, you can recompile the stored procedure. An alternative method is to run DBCC FREEPROCACHE which drops the procedure cache. It is better to recompile stored procedures rather than dropping the procedure cache as dropping the procedure cache affects all plans in cache rather than just the ones that were bad, so there will be a temporary performance penalty until the plans are loaded into cache again.
To determine if there is a hardware bottleneck occurring such as slow I/O or high CPU utilization, you will need to run Performance Monitor on the database server. Hopefully you already have a baseline of the server so you know what is normal and what is not. Be on the lookout for I/O requests taking longer than 12 milliseconds and CPU utilization over 90%. The servers that I support typically are under 30% CPU utilization, but your baseline could be higher and be within a normal range.
If restarting the SQL Server service fixes the problem, then the problem was most likely due to blocking or a bad plan in the procedure cache. Rather than restarting the SQL Server service, which causes downtime, the blog reader should instead analyze the above mentioned things. Proceed with caution when restarting the SQL Server service as all transactions that have not completed will be rolled back at startup. This crash recovery process could take longer than normal if there was a long-running transaction running when the service was stopped. Until the crash recovery process is completed on the database, it is unavailable to your applications.
If restarting IIS fixes the problem, then the problem might not have been inside SQL Server. Prior to taking this step, you should do analysis of the above mentioned things.
If you can think of other reasons why the blog reader is facing this issue a few times a week, I'd love to hear your thoughts via a blog comment.
Legacy Comments
Mbourgon
2010-03-08 |
re: Troubleshooting Application Timeouts in SQL Server Two thoughts, one the same as yours - it's busy recompiling the plan because of the statistics change. Try turning asynchronous stats on. Second, depending on the growth and I/O speed, it could be that the database is attempting to grow. Seems doubtful, but is an odd possibility that might not be considered. Personal bet, though, is blocking because it's waiting on the new data to finish being inserted. |
Subhodip
2010-03-09 |
re: Troubleshooting Application Timeouts in SQL Server Further to Mbourgon's comments, kindly note that we have checked for DB (mdf file) growth and there is still much room (secondary ndf is also present) left for growth, so presumably that is not the case. We also have created Maintenance plans for Rebuilding Indexes and Updating Statistics, so if these are the causes, that should get addressed. Moreover, we don't find any Implicit Conversions in Exec Plan, another likely reason for Clustered Index Scans. But, when I executed "SELECT * FROM sys.dm_exec_query_optimizer_info WHERE counter = 'timeout'", I can find the counter value increasing progressively (around 300-400 in a 12-14hr period). Currently it as at 890 and last DB restart was two days (around 42 hrs) back. Can anyone kindly tell me how to interpret and address this issue. |
Tara
2010-03-10 |
re: Troubleshooting Application Timeouts in SQL Server Subhodip, Your maintenance plans run on a scheduled basis, but update statistics may need to be run after bulk imports are run. Do your bulk imports affect 10% or more of the data in the table? If it can, then your statistics are outdated. The "SELECT * FROM sys.dm_exec_query_optimizer_info WHERE counter = 'timeout'" query is irrelevant to your issue. The timeout being logged there is not an application timeout, which is what your issue is. If I were in your situation, I would recompile the problematic stored procedures when the issue arises. If that doesn't fix it, then I'd update statistics on the tables that the stored procedures reference. Oh I'd first check for bad blocking, but I think your issue more likely a bad plan problem based upon your execution plan observation. ~Tara |
Subhodip
2010-03-11 |
re: Troubleshooting Application Timeouts in SQL Server Tara, The bulk imports are not affecting more than 10% of the data. Data inserted is not significant in size (200-400 KB) and insertion gets completed within a few secs. I believe the rowmodctr is also not being pushed over (checked it in sys.sysindexes). We have made a check & to-do list, based on your valued inputs, that needs to be performed next time we are facing this problem. Based on our findings, I will update my comments here for everyone's ready reference. Thanks a ton for all the suggestions!! |
Application Hosting
2010-03-25 |
re: Troubleshooting Application Timeouts in SQL Server I am just new to your blog and just spent about 1 hour and 30 minutes lurking and reading. I think I will frequent your blog from now on after going through some of your posts. I will definitely learn a lot from them. |
Subhodip Sengupta
2010-04-28 |
re: Troubleshooting Application Timeouts in SQL Server Based on the inputs as per this post, kindly note we have implemented Maintenance Plans for Rebuilding Indexes and Updating Statistics and didn’t encounter the problem of queries/SPs getting timed out for more than a month (previously avg. freq. of occurrence was 1-2 times/week). But, after more than a month we are again faced with this problem. An SP is taking more than 5-6mins to get executed when in normal circumstances it takes anywhere between 10-15secs. The SP name gets listed when I execute sp_blocker_pss08. We have checked for the following parameters and find them in place: 1) DB (mdf,ndf,ldf) file size not increasing. 2) CPU & Memory Usage is normal. 3) No Locked Processes and no Waiting Tasks. 4) No Table level IX, X or U Locks. 5) Statistics are up-to-date and not getting updated at that particular time. 6) Indexes are de-fragmented for all tables & their indexes in use. 7) Query execution Plan having Index Seek and no Table Scans. The problem occured today and thereafter resolved by itself within 20-30 mins. Kindly do let us know which other parameters we should be looking out for as we have almost exhausted options that we are cognizant of. If there are any more details that you need, kindly do let me know. |