I was recently called in to what has to be the most interesting low-availability environment I have ever seen. There was not any single thing that I had not seen before, but to see all of them together in one place was truly amazing. I try very hard not to say bad things about systems or in-house DBAs. It is bad practice to call a baby ugly, you might be standing next to its momma. This time I lost it on about my fourth WTF moment.
The platform was a Windows 2003 and SQL 2005 cluster. And this cluster was more if a US Marine Cluster than a Microsoft Failover Cluster, if you catch my meaning. Let’s just run down the list of “interesting” issues.
The cluster was configured flat-out wrong. The second node had half the processors and memory that the first node had. The hardware was about seven years old. Not even the original vendor had parts in stock when a CPU power regulator blew. Failover was never tested and did not work. The Cluster had missing and broken resources that were never cleaned up and prevented proper patching.
Notice I wrote Windows 2003, not 2003 R2 as the operating system. Yes, it was a 32-bit box…. with 64 GB of RAM. (That merited a facepalm all on its own). 13.5% of the CPU capacity was spent on managing AWE memory. I checked. This wouldn’t be a problem except that the main box was running flat out (95%-100% CPU) for twenty hours a day. nHibernate will do that to SQL. But the DBAs were OK with this.
The LUNs were misaligned.
There were over 500K records in the msdb.dbo.backupset table. No cleanout and no indexes added. Backups on the same SAN as the database. Again, the DBAs were OK with this.
They even had had a Disaster Recovery hope. Not a plan, a hope. It never was tested. Good thing, since SQL Replication is not really suited for DR, especially if you have business logic that relies on Foreign Keys and Triggers in your database. This was the solution the DBAs recommended for DR.
The DBAs proposed a Physical-to-Virtual migration of the broken cluster. The client has a nice VMWare cluster (on supported hardware) and a very good storage solution behind it. The system guys had tried this before. It took 26 hours to create the VM image of one node AND it failed about half the time. This time they “knew” it would work.
The CIO quit listening to the DBAs at this point. Evidently he had enough of “what got us here” and wanted more of “what will get us out of here”.
We proposed a SQL migration onto a newly build VM (2008 R2 OS and 64-bit SQL 2005 not clustered) with log shipping to the DR site across town. This new system would have aligned LUNS on supported hardware with all the proper performance tweaks a system like that is supposed to have.
We go live on our solution this weekend. And no, I won't tell you the company.