Written by Ryan Doon, Microsoft Premier Field Engineer.
Lately there is a question that I always seem to receive and it always starts with “What is the Microsoft Best Practice when it comes to ___________”. That blank represents a wide range of questions that can be asked, but for today’s discussion it will be on the following question
“What is the Microsoft Best Practice when it comes to patching clusters”
There is an article on the Microsoft support site that describes the process of patching the clusters. Instead of diving into the specific commands of how to do this in windows 2003 and 2008, I want to explain the basic principal of the best practice, which spans across Windows 2003 and 2008. In terms of patching within this article, it refers to installing a service pack or hotfix
STEP 1: Backup
Before making any changes to a production server it is essential to have a backup of that server. If for any reason the installed patch causes an issue, you will have the ability to revert the system back to its original state before attempting to patch the system. Here are some good best practices for backing up and restoring server data in case an issue occurs during the patching process.
STEP 2: Check the system event logs
Prior to making any changes to a server such as patch installations, you should check the system event logs to ensure you are not seeing any issues on the system. Ideally before patching the node there are no errors or warnings in the system event log. If there are any errors or warnings, this should be resolved before patching the node. For more information about system event logs please refer to the following article.
STEP 3: Pause the node
Choose the first node in the cluster that you would like to patch and PAUSE that node. When you pause a node, the existing groups and resources stay online, but additional groups and resources cannot be brought online on the node.
STEP 4: Moving the resources
After you have paused the node, you need to move the resources hosted by the node you are patching, to a node that is not being patched. The resources need to be moved to ensure that they remain online while the node is being patched. You can choose to move the resources to any node in the cluster, but ideally it is a node that is not heavily utilized. For those interested here is an example of how this process can be scripted.
STEP 5: Patch the node
Now that you have moved the resources to another node, you can now begin the patching process. The next steps would be to patch and then restart the node.
STEP 6: Check the system event logs
After patching and restarting the node, you need to check the system event logs to ensure there are no errors or warnings seen on the system after patching. If there are any errors or warnings, this should be investigated and resolved before continuing to patch the rest of the nodes in the cluster. Having a backup of the system to revert to is essential in this case, if the issues seen within the event log cannot be resolved in a timely manner. If there are no errors or warnings shown in the system event logs, then proceed to Step 7.
Step 7: Resume Node, and proceed to patch the next node in the cluster
You can now resume the node, and proceed with patching the next node in the cluster. Ideally the next node that you patch is the node that you had moved the resources of the first node to. Follow steps 1-7 again for the next node. While patching the nodes in the cluster, you are doing a balancing act with the resources to ensure they remain online. After the patching has been completed, move the resources back to their preferred node. Here is a great article that utilizes these concepts and provides the technical details behind patching a SQL server 2008 failover cluster.
In terms of timelines for patching of nodes in a cluster, each node in a cluster should be patched one after the other, in the soonest amount of time possible. A situation that you don’t want is patching nodes in a cluster weeks or days apart. This could easily bring issues to your cluster, since each node is not truly equal.
The process of patching clusters can be done manually, but for larger organizations this is something that should be automated. I am always curious to know how the process of patching clusters has gone within other organizations…. So I ask the following question:
What has your experience been like while patching clusters? ……and are you using an automated process or conducting the patching manually?