To help out in this thread I dug out a couple of URLs for good clustering resources on Microsoft's site. Looking though my home Wiki for the links got me thinking about a clustering job I did early this year which was pretty scary job and nerve-wracking.
The gig was for a reasonably high traffic web site. They had load balanced web and Java app servers and wanted to give the database some redundancy. The problem was, one of the servers intended for the cluster was the current database server. The meant that instead of building up and testing a cluster before deploying it, I had to :
- Build a single node cluster
- Bring the site down and transfer the live database to my single node cluster (the database was about 6GB backed up)
- Make a bunch of server IP address changes (the address was coded in various places in the app server)
- Bring up the site, hopefully pointing at the new virtual node.
- Breathe major sigh of relief
- Reformat and rebuild the spare machine (the old live database server)
- Apply all service packs etc
- Get the machine to join the live cluster (hoping like hell it didn't bring the other node down)
- Install an SQL Server node on the box (hoping like hell it didn't bring the other node down)
- Make sure it all worked by failing nodes over to eachother (hoping like hell it would actually come back up)
* I do not recommend doing this, it in no way represents any sort of a best practice (in fact quite the opposite). Don't try this at home.
Thankfully, the whole process went without any major hitches and the site only had about 40 minutes of downtime. There were a few VERY nervous minutes there hoping it all went OK. One interesting thing was the application servers would queue requests when the DB server became unavailable. When we manually failed over to the secondary node it took up to a minute, by which time the app servers had a LOT of requests lined up ready to go. When the secondary node came to, it was rudely assaulted by two app servers worth of requests. This really isn't a nice way to wake up so it got a little grumpy :). We managed to tune things so it handled it a little better and everything was happy again.
The other fun thing was that we couldn't get the firewall reconfigured to allow us to use terminal services to do any of the work, and it was too much hassle removing the machines to do work on them. So the above list represents me going insane sitting in an uncomfortable chair in a noisy and cold data center for 3 days straight.
Obviously this isn't a great way to work, so I'm not suggesting there are any lessons in there. However, to anyone starting to plan or implement something like this I have a few suggestions.
- Document EVERYTHING. When you build a cluster you will probably be creating a handful of domain users, assigning some new IP addresses, looking up DNS settings etc. I kept lots of paperwork and notes. At the end of the job the client was really happy because I could hand over a document that contained every setting, configuration and bit of information they needed to maintain the cluster.
- Script as much as you can. This lets you do lots of testing of things like restoring backups from one machine to the other (changing paths on 10 filegroups) to make sure you minimise surprises. Then when it comes to a time crunch like restoring the backup to the new live server you are not messing around changing file paths in Enterprise Manager
- When you start to bang your head up against the wall. Go outside, get some fresh air and a cold drink. Sitting in a datacentre all day can really mess your head up. If you think you must be missing something simple, the chances are good that you are. Take a break and the answer will probably jump at you when you come back. You also lessen your chance of disaster when you stay calm.
All in all, I'm glad I took on that job. I learnt lots, but wow... talk about stress levels :)