Microsoft SQL Server Cluster Patching

Article
05/05/2010

Caution

Test the script(s), processes and/or data file(s) thoroughly in a test environment, and customize them to meet the requirements of your organization before attempting to use it in a production capacity. (See the legal notice here)

Note: The workflow sample mentioned in this article can be downloaded from the Opalis project on CodePlex: https://opalis.codeplex.com

Overview

The Microsoft SQL Server Cluster Patching sample is designed to highlight a workflow design approach that could be used to orchestrate the activity between Microsoft System Center Configuration Manager, Operations Manager and the Windows platform to patch MS SQL Servers using a variety of HA options. The workflow itself would no doubt require considerable editing in order for it to work in a given environment. It is not expected it would run as a demonstration since it would require access to Microsoft System Center Configuration Manager, Operations Manager and a Windows platform hosting an application.

Workflow Walk-Through

“1.0 Submit Request”

This is a trivial workflow that consists of a single trigger. It kicks off the workflow that patches an application cluster. The reason this workflow exists is to illustrate how one would separate the initiation of a process (such as “patch a cluster”) from the acceptance of the request to do the work. In this sample a simple trigger launches the Patch Cluster workflow; however this trigger could be used as part of a larger workflow that accepted cluster patching requests from any number of possible sources including integration with a Service Desk or some other request processing system. The key concept here is that the sample has isolated the acceptance of the request from the processing of the request itself. Hence the workflow that actually patches a cluster can be re-used in different context from the intent of the original design. In order to patch multiple clusters (in parallel) one would simple trigger “Patch Cluster” multiple times.

“2.0 Patch Cluster”

This workflow is designed to show how one might apply the basic concepts of Opalis workflow design and the Opalis databus architecture to orchestrate activities between Operations Manager and Configuration Manager to patch MS SQL Servers using a variety of HA options. The process for manually patching such a cluster might involve the following activities:

Identify which MS SQL Servers are to be patched.
Patch each system sequentially, verifying that with each system patched the patch is successfully deployed and the server returns to a normal state as a member of the cluster.
- Patching an individual system will involve certain steps require to remove the node from the cluster, verify health after the patch, etc. These steps are detailed in “3. Patch Node” section below.
Once the last node has been patched, verify that the overall cluster is healthy.
If the health of the cluster does not return, even though all the nodes in the cluster were patched, indicate the failure and enable some sort of remediation to take place.

A sample workflow that might automate such a sequence of events might look like the workflow shown in the sample:

Notice how the “Get Collection Members” activity (from the Configuration Manager Integration Pack) gets a list of collection members associated with the cluster that is to be patched. This activity will return a list of systems. Note that the “For Each Node in Cluster” is a label on a link condition and is not part of the workflow logic itself. The Opalis databus takes each data instance from “Get Collection Members” and sequentially calls “Patch Cluster Node”. Essentially, what is shown is a process that sequentially patches an entire cluster of nodes one node at a time (the desired behavior). Each node will be patched before Opalis will move on to patch the next node. Note that if one were to open up the “Patch Cluster Node” workflow one would see the “Wait” flag checked in the trigger activity, indicating that the workflow is to wait for the child workflow to finish before returning to the parent. Were this option not checked, all the nodes in the cluster would have been patched in parallel.

Notice the “Junction for Patching”. This makes sure that all work from the prior activities has finished before proceeding to the next activity (“Get Cluster Monitor”). This is needed since it collapses the data bus down to a single instance of data. Without the junction, the “Get Cluster Monitor” would be called once for every node in the cluster (not desired).

Another noteworthy aspect of the workflow is the “Get Cluster Monitor” activity. This uses Looping to test/retest the state of the Operations Manager:

Every 30 seconds the Operations Manager monitor for the health of the Cluster is checked to see if the cluster has returned to Normal. Notice the Exit condition for “HealtState” indicates the loop will continue if the health is not “Normal”. The loop looks for a return to “Normal” state 10 times before giving up. Hence this looping property looks for the cluster to be seen as healthy by Operations Manager within 10 tries x 30 seconds/try = 5 minutes.

“2.1 Patch Node”

Recall that the “2.0 Patch Cluster” workflow calls a child workflow “Patch Node”. This workflow knows the process for patching a single node in a cluster. The process is beyond any doubt going to change with each application. In this sample, the activities between Configuration Manager and Operations Manager are orchestrated to patch a single cluster node. The manual process to perform this task would look something like this:

Put the server in “Maintenance Mode” in Operations Manager.
Drain the application pool using PowerShell cmdlets from Microsoft that can be found on TechNet as a resource to admins working with NLB clusters.
Remove the node from the NLB cluster, again using PowerShell cmdlets from Microsoft that can be found on TechNet as a resource to admins working with NLB clusters.
Wait for the patch to complete.
Reboot the system.
Run a test of the application and wait until it comes back from the reboot. Note this is an application-level test. We are not assuming the application comes up when the system is back on-line after the reboot.
If the application comes up, remove it from Operations Manager “Maintenance Mode”.
Add the system back into the NLB cluster, once again using PowerShell cmdlets from Microsoft that can be found on TechNet as a resource to admins working with NLB clusters.

The workflow itself is fairly straightforward, however it no doubt will have to be modified to reflect a given environments unique requirements:

Notice how looping is once again used to wait until a desired state is reached before proceeding with the workflow. This is one of many typical applications for looping.