With CLUE as (Select * from Random_Thought ORDER BY Common_Sense DESC)

SQL Server thoughts, observations, and comments
posts - 15, comments - 13, trackbacks - 0

Tuesday, April 15, 2008

Breakable

One of the major pain points in SQL Clustering is what is referred to by Microsoft as "servicing". Installation, Service Packs, Hotfixes, and Cumulative Updates bring special headaches to those of use that are responsible for the care and feeding of SQL Clusters. Sometimes you can end up with a Clustered system that appears un-patchable. This is one of those times.

The base system is a pretty typical basic cluster. One instance, two nodes whose primary purpose is to host SSIS. The SQL engine exists to drive the SQL Agent for scheduled SSIS jobs. The problem began with Node1 failing completely. Normally this is no problem, and we started building a replacement node and reconfiguring the clustering. Everything went according to plan until we got to the last step in BOL:

  1. All nodes of a failover cluster instance must be at the same version level. After completing SQL Server Setup, you must download and apply the latest SQL Server 2005 service pack and/or patches to ensure that all failover cluster nodes are at the same version level.

OK, lets apply SP2. Starting with Node2, I apply Service Pack 2 and it promptly fails. Digging into the Summary.txt log gives the following error details:

**********************************************************************************

Products Disqualified & Reason

Product                                   Reason

Database Services (MSSQLSERVER)           The product instance MSSQLSERVER been patched with more recent updates.

**********************************************************************************

Processes Locking Files

Process Name          Feature               Type          User Name                  PID

 

 

**********************************************************************************

Summary

     Product instances were disqualified due to build version mismatch

     Exit Code Returned: 11203

 

 

Not exactly unexpected since I actually needed to patch Node1. So I shift the SQL instance to the other node and start over. Summary.txt gives me this:

----------------------------------------------------------------------------------

Product                   : Database Services (MSSQLSERVER)

Product Version (Previous): 1399

Product Version (Final)   : 

Status                    : Failure

Log File                  : 

Error Number              : 11009

Error Description         : No passive nodes were successfully patched

----------------------------------------------------------------------------------

 

Hmm. I can't patch Node2 because it is already higher than the SP I am applying but I can't patch Node1 because I can't patch Node2 because Node2 is already patched. Unlike in SQL 2000, there is no "binary-only" option to patch the SQL code bits on Node1. After all, the actual databases were patched when the whole cluster was built up to build 3161 so all I need to update is the SQL executables. Therefore, the cluster recovery instructions as written in BOL are impossible to follow if the system is not at exactly at Service Pack revision. Given that SP2 was broken from the beginning and we MUST add a hotfix to get it to work correctly, this is more than merely painful.

All hope is not lost, however. There is a way to fix the cluster. The steps are actually pretty simple, but I strongly suggest testing this yourself before applying it to a production cluster. I tested this using a Virtual Server hosted cluster.

First, you remove Node2 (the fully patched node) from SQL following the steps in BOL. Do not evict it from the cluster unless you want to do a lot of extra work. Reboot Node2.

Then you re-add Node2 to the SQL instance. Now both nodes have RTM (Build 1399) binaries. You can then walk them up the patch chain together, rebooting as necessary, to get to whatever patch level you have decided is appropriate.

No SQL configuration gets lost since the clustered instance never goes away. This will take a significant amount of downtime due to the multiple installs, patches, and reboots. My production fix took about an hour and a half on good name-brand, current tech quad-socket servers.

posted @ Tuesday, April 15, 2008 7:00 PM | Feedback (0) |

Thursday, March 27, 2008

Speaking Schedule

If you feel an uncontrollable urge to listen to my voice or to meet me in person, here is your chance.

You can listen to me on Greg Low's SQL Down Under Podcast. I discuss SQL Clustering, SANs, and ramble on about other topics as well.

http://www.sqldownunder.com/PreviousShows/tabid/98/Default.aspx

I will be joining Kevin Kline and Hilary Cotter on a webcast for Quest Software on April 9th. We will be discussing features in the forthcoming SQL 2008 product.

http://www.quest.com/events/listdetails.aspx?contentid=7059&searchoff=true&technology=34&prod=&prodfamily=&loc=

And I will be speaking at the SQLSaturday #3 event in Jacksonville, Florida on May 3rd.

http://www.sqlsaturday.com/default.aspx

 

 

posted @ Thursday, March 27, 2008 10:36 AM | Feedback (0) | Filed Under [ SQL General ]

Wednesday, February 06, 2008

Leverage

One definition of leverage is "the use of a small initial investment, credit, or borrowed funds to gain a very high return in relation to one's investment, to control a much larger investment, or to reduce one's own liability for any loss." (Courtesy Dictionary.Com).

In programming, the smallest change we can make is a single bit. The next smallest change and the one we can make most effectively in human-readable source code is a single character. This story is how changing one character in a 300 line stored procedure removed 90% of the impact of the worst single query on the entire server.

This process began with a fairly normal analysis of a poorly performing server. As usual, the server was found to be I/O bound. Routine analysis indicated a particular query within a stored procedure to be the worst performing query. Comparing this query to the second worst query on the top ten list revealed it consumed 15 times the CPU and 20 times the IO cycles of the next worst offender. Calling Captain Obvious, please report to Performance Tuning.

I did an index analysis. No joy there. Everything looked correct and the query plans were fine. The query took a lookup string, hit a master lookup table, did key lookups across 4 other tables (three joins deep at most) and returned the extended attributes of the master key. The master table had about 400K rows and was constantly growing with new key values. This lookup was executed hundreds of thousands of times a day, so any improvement would be significant.

Since the schema and data was from an external data provider, I couldn't make any significant changes, nor was there any support for de-normalizing the data for a flat lookup.

My brilliant idea was to create an indexed view and let SQL Enterprise Edition do an under-the-covers substitution. Like most brilliant ideas, this did not work as expected. I could not get SQL to substitute the view for the underlying tables. ARRGHHH!

Finally I realized the problem was not in the query, nor was it in the index. It was something we usually take for granted. An implicit data type conversion was killing me. This was the culprit:

WHERE MasterKey = @LookupKey

Well, it was sort of the culprit. You see, Master_key was defined as varchar(10) by the external data provider and @LookupKey was defined by a wannabe senior developer. Naturally, a real developer uses forward-compatible data types like nvarchar(10) and not nearly obsolete types like varchar(10).

Most of us know not to write something like this:

WHERE Column_Value / 2 = 4

We write something like:

WHERE Column_Value = 8

The idea is to NOT force SQL to scan the column, run every value through a function, and then compare it to a constant. The correct way is to pre-calculate the constant and compare it to the column. Turns scans into seeks, cuts IO, promotes world peace, you get the idea.

So what does this have to do with the first WHERE clause. One rule on implicit conversions to larger data types is that the lower type always gets promoted. Thus the WHERE clause actually became:

WHERE convert(nvarchar(10), MasterKey) = @LookupKey)

Now that is pretty obviously bad.

By now, you have likely figured out the one character fix. Remove the 'N' from the declare function and use the old, bad, non-portable data type. Given the data definition for the real-world data type in this example only uses A-Z and 0-9 characters and the fix was at least a full order of magnitude faster, I can live with that.

Leverage. One character change fixes an entire problem server.

 

posted @ Wednesday, February 06, 2008 9:23 PM | Feedback (2) | Filed Under [ SQL General ]

Monday, January 21, 2008

MSDB Performance Tuning

Backup and Restore is arguably the most important task a DBA can do, but sometimes it is very frustrating to manage these elements. Some common problems include very slow UI interaction, both for SQL Enterprise Manager and for SQL Server Management Studio, mysterious timeouts, failure to delete old backup files, and other intermittent failures. Most of these failures can be traced back to the same underlying problem and thus have the same solution.

The MSDB database contains a series of system tables that track everything having to do with backups on that server. One of the limitations of vendor testing is that no single test platform ever runs for anything like the lifetime of a typical production SQL Server. Because of this, no test server accumulates the history in the MSDB tables that real-world systems do. Since SQL uses its standard table storage and retrieval mechanisms to store this information, scalability should not be a problem. Wrong. Just as with user tables, system tables will perform poorly if they are not correctly indexed. Now for the real kicker:

There are no indexes on the MSDB backup tracking tables.

That is right. Not a single index. No Referential Integrity, nothing.

In SQL 2000, this is even worse as there is no MSDB cleanout component in the maintenance plan wizard. SQL 2005 has it in SP2 and higher releases.

Fixing this problem is actually fairly easy. Here is a script to create indexes on the MSDB backup tables. This handles the basic performance issue, however it is still a good idea to clean out the old backup information. Unless you have an amazing archive system, tracking a transaction log backup from three years ago will not do you much good.

/************************************************************************

*                                                *

*    Title:    msdb index creation                        *

*    Author:    Geoff N. Hiten                            *

*    Purpose: Index msdb database                            *

*    Date:    12/12/2005                                    *

*    Modifications:                                    *

*                                                *

*    01-01-2001                                        *

*        Sample Entry                                *

*                                                *

*                                                *

************************************************************************/

use msdb

go

--backupset

Create index IX_backupset_backup_set_id on backupset(backup_set_id)

go

Create index IX_backupset_backup_set_uuid on backupset(backup_set_uuid)

go

Create index IX_backupset_media_set_id on backupset(media_set_id)

go

Create index IX_backupset_backup_finish_date on backupset(backup_finish_date)

go

Create index IX_backupset_backup_start_date on backupset(backup_start_date)

go

--backupmediaset

Create index IX_backupmediaset_media_set_id on backupmediaset(media_set_id)

go

--backupfile

Create index IX_backupfile_backup_set_id on backupfile(backup_set_id)

go

--backupmediafamily

Create index IX_backupmediafamily_media_set_id on backupmediafamily(media_set_id)

go

--restorehistory

Create index IX_restorehistory_restore_history_id on restorehistory(restore_history_id)

go

Create index IX_restorehistory_backup_set_id on restorehistory(backup_set_id)

go

--restorefile

Create index IX_restorefile_restore_history_id on restorefile(restore_history_id)

go

--restorefilegroup

Create index IX_restorefilegroup_restore_history_id on restorefilegroup(restore_history_id)

go

 

/************************************************************************

*    End Script                                        *

************************************************************************/

 

Notice that these are performance indexes only; they do not enforce any constraints.

For SQL 2000 users, there is one more bit of work to do. You need to schedule a job to clean out MSDB backup history on a regular basis. SQL 2005 users can set the maintenance plans to handle this. The command is:

use msdb

go

declare @OldestDate datetime

set @OldestDate = getdate() -120

 

exec sp_delete_backuphistory @OldestDate

If you try and run this command on a large set of data without indexing, it may take hours to days to complete, meanwhile it will block backups while it locks the tables.

120 is the number of days to retain backup data. This number should be slightly longer than your maximum live backup retention time. Ignore any backups archived to tape since those likely will get restored through an alternate path anyway.

 UPDATE:

As Uri pointed out, we can't do date math in the procedure call, so I wrote a minor modification to create a parameter and feed it to the procedure.  Good Catch.  Thanks. 

posted @ Monday, January 21, 2008 11:30 AM | Feedback (4) |

Wednesday, January 02, 2008

Worst IT Job (Update)

I spoke to someone at Ganz today about the Webkinz World site issues. The problem was twofold. First, the sales side didn't communicate expectations correctly to the folks who run the site. As such, they were totally overwhelmed by the numbers of kids "adopting" their new pets. The regular site could almost hold the load, but the "adoption center", essentially the code fulfillment center, couldn't handle the load. I still don't totally buy that since the valid codes had to be loaded into the system somehow. There was also some type of hardware failure as well, but I don't have the details. Still, the lesson in capacity planning and disaster recovery is clear. Murphy is the patron saint of Computer Science for a reason.

Ganz has also issued an apology on the Webkinz World site. As an offer for amends, they will extend registrations for one month and add 500 "KinzCash" points to everyone's account. There are certainly worse ways to handle this type of problem. I commend Ganz for trying to make the best out of a bad situation.

So, do you have a communication plan as part of your disaster mitigation and recovery strategy?

 

posted @ Wednesday, January 02, 2008 10:49 AM | Feedback (2) |

Friday, December 28, 2007

The worst IT job (at least for today)

There is an old saying in politics and entertainment that there is no such thing as BAD publicity. Unfortunately, the opposite is true for IT, especially in the Database realm. If your name gets mentioned outside of a technology publication, you can bet it is something bad. Data losses, system outages, privacy disclosures, and other public failures are how DBAs get famous. Unfortunately, that is the world we operate in. Today we have another poor soul to add to the roll call of failure. The anonymous person in charge of the Webkinz web site database is now one of the most reviled people in IT, not to mention one of the most overworked. He has managed to single-handedly kill Christmas joy for untold numbers of small children (one of mine amongst them).

For those who do not have kids in the 5-12 year old range, here is the quick version of Webkinz. Take a Beanie-Baby like plush toy that costs pennies to make in China (slight upcharge for the unleaded version). Put a "secret code" on a toe-tag. The "Secret Code" gets you access to an on-line "world" where a virtual avatar of your pet can play. More activity on the site earns you points towards virtual stuff for your "adopted" pet. Sell plush toys for $8-$15 each to parents of demanding children. Expire accounts that do not register a new code within one year to drive ongoing demand. When you get "critical mass" sell ads in the virtual world. Of course, don't mention your plans for the last step until you actually do have critical mass. J

Sheer genius. Moneymaking gold mine, right? Except what happens when you sell more plush toys than your web site can handle? What happens when you end up with one of the most popular toys of the Christmas season amongst the tween and pre-tween age set? You would think that scalability would be easy to plan for here. You write a code, you add a user slot into your infrastructure. It ain't like a Super Bowl ad where you can only guess at the response. Plus, you plan that ALL of your customers will want to play with their new toys during Christmas. Free clue to site administrator: Kids are out of school during this time of year.

As of this writing, the Webkinz World web site has been effectively unusable since sometime early on Christmas day. The site operators threw in the towel late yesterday and took the entire thing offline for "maintenance". So, either we have a hopelessly incompetent IT Staff and DBA that cannot do basic math, or a company that willfully refused to put enough resources to fulfill their promises. Promises made to children. I doubt this is the image or publicity that Ganz (the parent company) wanted. Of course, they can always make it worse with a bad or nonexistent public communications response, but that remains to be seen.

So, what are you doing to make sure your employer never gets this kind of publicity? Capacity planning is not optional in our profession.

posted @ Friday, December 28, 2007 3:24 PM | Feedback (1) | Filed Under [ Low Availability ]

Friday, November 16, 2007

SQL 2008 CTP5 First Glance

Microsoft began releasing SQL 2008 CTP5 yesterday through Connect. It should hit MSDN within a few days so everyone can play along. The most obvious change for this release is a much-needed improvement replacement of the installer. As I mentioned in an earlier post, the installer is being totally re-engineered. We still don't have clustering support in this CTP, but the basic installer changes are now included. While there are a few "fit and finish" items that need cleanup, the overall design of the installer is very good. The requirements checker is much lighter and cleaner. The install review presentation page is in a hierarchal form and actually makes sense. Best of all, you can specify separate default directories for system databases, user databases, user database log, tempdb data, tempdb log and backup target locations. Installation is one area where attention to detail and listening to customer feedback has given the SQL development team a significant win.

posted @ Friday, November 16, 2007 2:11 PM | Feedback (0) | Filed Under [ SQL General ]

Monday, November 12, 2007

Update: SQL 2005 Build 3186 Cluster problem

Microsoft does not have a complete resolution for this problem yet, but they have found some more details.  Evidently the problem with SQL Agent failure only occurs on systems using a domain admin account for the SQL Agent Service account.  Microsoft is not 100% sure yet, so this is just a preliminary finding.  However, it does match my own personal experiences.  Worst Practices always has a cost.  Several DBAs just found that out the hard way.  Just as a reminder, this problem only occurs on x64 clusters using SQL Server 2005 build 3186 and higher.

posted @ Monday, November 12, 2007 4:09 PM | Feedback (0) | Filed Under [ High Availability Low Availability Microsoft ]

Thursday, November 01, 2007

Online Reindex = ON

For the past several editions, SQL Books On-Line (BOL) has helpfully included a script to rebuild or defragment (since 2000) an index. Being Microsoft, this script is NOT located under the reindex or defragmentation topic, it is included in the fragmentation analysis section. For SQL 2000, this is DBCC SHOWCONTIG. For SQL 2005, they rewrote it to use the new system views and stashed it under sys.dm_db_physical_stats. It also uses the new ALTER INDEX command rather than the older DBCC DBREINDEX or DBCC INDEXDEFRAG command. However, this script does not allow you to take advantage of online indexing

My first attempt to modify the script consisted of dropping the defragmentation option and simply adding a WITH ONLINE = ON line to the script. As with most quick and dirty mods, this failed miserably. I soon discovered that some indexes cannot be rebuilt online. There are two conditions that will prevent online reindexing, both of which involve BLOB data types:

You cannot online reindex a non-clustered index that contains a BLOB column.

You cannot online reindex a clustered index of a table that contains a BLOB column.

Pretty straightforward, or so it seems. The code to identify BLOB columns from an index is a bit tricky, but not too impossible. I used actual data type names rather than the internal data type IDs to make the code more readable. I also added a filter condition to remove "trivial" sized indexes. These are typically allocated into mixed extents and do not respond to defragmentation. There is no real harm in including them, but they clutter up the result set when troubleshooting. I kept tweaking the number until I got it as low as possible without an index reappearing in subsequent runs. The test platform was a Microsoft CRM implementation. For those unfamiliar with the table design of MSCRM, it uses GUIDs as primary keys and accepts the default of clustering the primary key. This makes for a nicely fragmented database to test, especially if there is a lot of insert and delete activity.

So without further ado, here is the modified BOL code that allows for scripting online index rebuilds.

-- Adapted from BOL script by Geoff N. Hiten.

USE <Your Database name here>

GO

-- Ensure a USE <databasename> statement has been executed first.

 

 

SET NOCOUNT ON;

DECLARE @objectid int;

DECLARE @indexid int;

DECLARE @partitioncount bigint;

DECLARE @schemaname nvarchar(130);

DECLARE @objectname nvarchar(130);

DECLARE @indexname nvarchar(130);

DECLARE @partitionnum bigint;

DECLARE @partitions bigint;

DECLARE @frag float;

DECLARE @command nvarchar(4000);

DECLARE @HasBlobColumn int;

DECLARE @MaxFragmentation int

DECLARE @TrivialPageCount int

 

-- Tuning constants

SET @MAxFragmentation = 10

SET @TrivialPageCount = 12

 

-- Conditionally select tables and indexes from the sys.dm_db_index_physical_stats function

-- and convert object and index IDs to names.

SELECT

object_id AS objectid,

index_id AS indexid,

partition_number AS partitionnum,

avg_fragmentation_in_percent AS frag

INTO #work_to_do

FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL , NULL, 'LIMITED')

WHERE avg_fragmentation_in_percent > @MaxFragmentation -- arbitrary threshold. YMMV

    AND index_id > 0 -- cannot defrag a heap

    and page_count > @TrivialPageCount -- ignore trivial sized indexes

-- Declare the cursor for the list of partitions to be processed.

DECLARE partitions CURSOR FOR SELECT * FROM #work_to_do;

 

-- Open the cursor.

OPEN partitions;

 

-- Loop through the partitions.

WHILE (1=1)

BEGIN;

FETCH NEXT

FROM partitions

INTO @objectid, @indexid, @partitionnum, @frag;

IF @@FETCH_STATUS < 0 BREAK;

        Set @HasBlobColumn = 0 -- reinitialize

SELECT @objectname = QUOTENAME(o.name), @schemaname = QUOTENAME(s.name)

            FROM sys.objects AS o

            JOIN sys.schemas as s ON s.schema_id = o.schema_id

            WHERE o.object_id = @objectid;

SELECT @indexname = QUOTENAME(name)

            FROM sys.indexes

            WHERE object_id = @objectid AND index_id = @indexid;

SELECT @partitioncount = count (*)

            FROM sys.partitions

            WHERE object_id = @objectid AND index_id = @indexid;

        -- Check for BLOB columns

        if @indexid = 1 -- only check here for clustered indexes ANY blob column on the table counts

            Select @HasBlobColumn = case when max(so.object_ID) IS NULL then 0 else 1 end

                From sys.objects SO

                inner join sys.columns SC

                    on SO.Object_id = SC.object_id

                inner join sys.types ST

                    on SC.system_type_id = ST.system_type_id

                    and ST.name in ('text', 'ntext', 'image', 'varchar(max)', 'nvarchar(max)', 'varbinary(max)', 'xml')

                where SO.Object_ID = @objectID

            else -- nonclustered. Only need to check if indexed column is a BLOB

            Select @HasBlobColumn = case when max(so.object_ID) IS NULL then 0 else 1 end

                from sys.objects SO

                inner join sys.index_columns SIC

                    on SO.Object_ID = SIC.object_id

                inner join sys.Indexes SI

                    on SO.Object_ID = SI.Object_ID

                    and SIC.index_id = SI.index_id

                inner join sys.columns SC

                    on SO.Object_id = SC.object_id

                    and SIC.Column_id = SC.column_id

                inner join sys.types ST

                    on SC.system_type_id = ST.system_type_id

                    and ST.name in ('text', 'ntext', 'image', 'varchar(max)', 'nvarchar(max)', 'varbinary(max)', 'xml')

                where SO.Object_ID = @objectID

        SET @command = N'ALTER INDEX ' + @indexname + N' ON ' + @schemaname + N'.' + @objectname + N' REBUILD';

        if @HasBlobColumn = 1

            Set @command = @command + N' WITH( SORT_IN_TEMPDB = ON) '

        else

            Set @command = @command + N' WITH( ONLINE = ON, SORT_IN_TEMPDB = ON) '

IF @partitioncount > 1

SET @command = @command + N' PARTITION=' + CAST(@partitionnum AS nvarchar(10));

PRINT N'Executing: ' + @command + ' Has Blob = ' + convert(nvarchar(2),@HasBlobColumn);

        EXEC (@command)

-- PRINT N'Executing: ' + @command;

END;

 

-- Close and deallocate the cursor.

CLOSE partitions;

DEALLOCATE partitions;

 

-- Drop the temporary table.

DROP TABLE #work_to_do;

GO

posted @ Thursday, November 01, 2007 9:35 AM | Feedback (1) | Filed Under [ High Availability SQL General ]

Thursday, October 18, 2007

One step forward, two steps back.

We sometimes do things as DBAs that are self defeating, especially regarding high availability. We can get so focused on the One True Thing™ that will solve all of our problems that we don't realize that the way we implement something can end up costing us all the benefit. Clustering is often seen as the complete solution to availability. Unfortunately, clustering adds complexity to the system which can then impact stability. The way we remove that uncertainty is to use high quality hardware that is tested and approved for clustering. We have the Windows Catalog for Clustering (formerly the Hardware Compatibility List) that tells us that our proposed solution will work. When we stray from this list, we are courting disaster. Most of the failures I have seen in the past can be attributed to NOT following the guidelines on recommended hardware. I have two items in particular that stick out as Low Availability solutions.

The first Low Availability technique I want to talk about is clustering blade servers. Let's look at what clustering does for us. Clustering's primary benefit is an immediate hot-standby server to protect us from hardware failure. Clustering will do nothing to stop or recover from a "DROP DATABASE Payroll" command. Blades, on the other hand, exist to reduce data center cost by consolidating hardware. Blades share various components depending on the manufacturer. Power supplies are almost always shared in a blade chassis. Some chassis share network switches, KVM connections, or even centrally stored boot images. These common connections are common points of failure. Some failures will take out all blades in a chassis. Some central configuration changes can drop a chassis offline. Common failure points reduce the benefit of a cluster, sometimes below the availability a stand-alone server can offer.

The second Low Availability hardware component for SQL Server is iSCSI. Right now, there is limited support for iSCSI and SQL, but even Microsoft cautions that iSCSI is not typically a high-performance solution (http://support.microsoft.com/kb/833770). Every single cluster I have been involved with that used iSCSI has had performance and stability issues. Every. Single. One. The stability issues are a result of the low-performance. Sometimes, I/O latency on the device means the Quorum drive becomes unresponsive. Nothing good happens after that. I know iSCSI is a cheap way to buy a multi-connected storage device, but "cheap" and "Highly Available" just don't go together. Sometimes the best you can do is to get a good stand-alone server and leave clustering until you can get a system that solves more availability issues than it creates.

posted @ Thursday, October 18, 2007 3:24 PM | Feedback (0) | Filed Under [ Low Availability ]

Powered by: