One step forward, two steps back.

Thu Oct 18, 2007 by Geoff N. Hiten in low-availability

We sometimes do things as DBAs that are self defeating, especially regarding high availability. We can get so focused on the One True Thing™ that will solve all of our problems that we don't realize that the way we implement something can end up costing us all the benefit. Clustering is often seen as the complete solution to availability. Unfortunately, clustering adds complexity to the system which can then impact stability. The way we remove that uncertainty is to use high quality hardware that is tested and approved for clustering. We have the Windows Catalog for Clustering (formerly the Hardware Compatibility List) that tells us that our proposed solution will work. When we stray from this list, we are courting disaster. Most of the failures I have seen in the past can be attributed to NOT following the guidelines on recommended hardware. I have two items in particular that stick out as Low Availability solutions.

The first Low Availability technique I want to talk about is clustering blade servers. Let's look at what clustering does for us. Clustering's primary benefit is an immediate hot-standby server to protect us from hardware failure. Clustering will do nothing to stop or recover from a "DROP DATABASE Payroll" command. Blades, on the other hand, exist to reduce data center cost by consolidating hardware. Blades share various components depending on the manufacturer. Power supplies are almost always shared in a blade chassis. Some chassis share network switches, KVM connections, or even centrally stored boot images. These common connections are common points of failure. Some failures will take out all blades in a chassis. Some central configuration changes can drop a chassis offline. Common failure points reduce the benefit of a cluster, sometimes below the availability a stand-alone server can offer.

The second Low Availability hardware component for SQL Server is iSCSI. Right now, there is limited support for iSCSI and SQL, but even Microsoft cautions that iSCSI is not typically a high-performance solution (http://support.microsoft.com/kb/833770). Every single cluster I have been involved with that used iSCSI has had performance and stability issues. Every. Single. One. The stability issues are a result of the low-performance. Sometimes, I/O latency on the device means the Quorum drive becomes unresponsive. Nothing good happens after that. I know iSCSI is a cheap way to buy a multi-connected storage device, but "cheap" and "Highly Available" just don't go together. Sometimes the best you can do is to get a good stand-alone server and leave clustering until you can get a system that solves more availability issues than it creates.