Geoff N. Hiten Blog

SQL Server thoughts, observations, and comments

Breakable

One of the major pain points in SQL Clustering is what is referred to by Microsoft as "servicing". Installation, Service Packs, Hotfixes, and Cumulative Updates bring special headaches to those of use that are responsible for the care and feeding of SQL Clusters. Sometimes you can end up with a Clustered system that appears un-patchable. This is one of those times.

The base system is a pretty typical basic cluster. One instance, two nodes whose primary purpose is to host SSIS. The SQL engine exists to drive the SQL Agent for scheduled SSIS jobs. The problem began with Node1 failing completely. Normally this is no problem, and we started building a replacement node and reconfiguring the clustering. Everything went according to plan until we got to the last step in BOL:

  1. All nodes of a failover cluster instance must be at the same version level. After completing SQL Server Setup, you must download and apply the latest SQL Server 2005 service pack and/or patches to ensure that all failover cluster nodes are at the same version level.

OK, lets apply SP2. Starting with Node2, I apply Service Pack 2 and it promptly fails. Digging into the Summary.txt log gives the following error details:

**********************************************************************************

Products Disqualified & Reason

Product                                   Reason

Database Services (MSSQLSERVER)           The product instance MSSQLSERVER been patched with more recent updates.

**********************************************************************************

Processes Locking Files

Process Name          Feature               Type          User Name                  PID

 

 

**********************************************************************************

Summary

     Product instances were disqualified due to build version mismatch

     Exit Code Returned: 11203

 

 

Not exactly unexpected since I actually needed to patch Node1. So I shift the SQL instance to the other node and start over. Summary.txt gives me this:

----------------------------------------------------------------------------------

Product                   : Database Services (MSSQLSERVER)

Product Version (Previous): 1399

Product Version (Final)   : 

Status                    : Failure

Log File                  : 

Error Number              : 11009

Error Description         : No passive nodes were successfully patched

----------------------------------------------------------------------------------

 

Hmm. I can't patch Node2 because it is already higher than the SP I am applying but I can't patch Node1 because I can't patch Node2 because Node2 is already patched. Unlike in SQL 2000, there is no "binary-only" option to patch the SQL code bits on Node1. After all, the actual databases were patched when the whole cluster was built up to build 3161 so all I need to update is the SQL executables. Therefore, the cluster recovery instructions as written in BOL are impossible to follow if the system is not at exactly at Service Pack revision. Given that SP2 was broken from the beginning and we MUST add a hotfix to get it to work correctly, this is more than merely painful.

All hope is not lost, however. There is a way to fix the cluster. The steps are actually pretty simple, but I strongly suggest testing this yourself before applying it to a production cluster. I tested this using a Virtual Server hosted cluster.

First, you remove Node2 (the fully patched node) from SQL following the steps in BOL. Do not evict it from the cluster unless you want to do a lot of extra work. Reboot Node2.

Then you re-add Node2 to the SQL instance. Now both nodes have RTM (Build 1399) binaries. You can then walk them up the patch chain together, rebooting as necessary, to get to whatever patch level you have decided is appropriate.

No SQL configuration gets lost since the clustered instance never goes away. This will take a significant amount of downtime due to the multiple installs, patches, and reboots. My production fix took about an hour and a half on good name-brand, current tech quad-socket servers.

Legacy Comments


Susan Van Eyck
2008-06-03
re: Breakable
Geoff,

Thanks for this post! I'm adding it to my arsenal of cluster tools. I've gone many rounds with a 6 node cluster hosting 5 SQL instances. Getting them up to SP 2 has been awful. I've still got one stuck at build 2047 that refuses to upgrade. (This after an absolutely perfect upgrade to the corresponding staging system!). Grrrr!

Susan