Keep your Failover Clustering deployment healthy!

Post contributed by Craig Forster, our Premier Field Engineer (PFE) in Denmark. He shares his top tips to keep your clusters healthy – this is based the collective experience of our PFEs, so treat this as the ‘inside scoop’ Smile


During the transition from Windows Server 2003 to the newer versions, something new and revolutionary was driving a broad range of changes to the Clustering solution in Windows, something that required scalability and a new way of thinking about every component in the cluster, from networking, to storage, to heart beating to quorum – everything. That driver was Hyper-V.

The benefit of that to you is that every cluster gets to take advantage of that investment in Failover Clustering which was driven by Hyper-V. Meaning that a cluster built on Windows Server 2008 or newer generally has far fewer technical issues than one built on Windows Server 2003.

Many of the technical issues we were concerned about were addressed by the Windows Product Group when they were building the Failover Clustering technology in Windows Server 2008, and later Windows Server 2008 R2. But the operational, non-technical process issues still remained.

CSRAP

The PFE team have years of experience analyzing our customers’ Failover Clustering deployments - looking for both health issues and potential risks. We do this as part of the Risk Assessment Program for Cluster Server, or CSRAP in short.

The CSRAP uses a custom tool which automatically evaluates cluster nodes; and we also manually gather a list of the operational processes used to manage the cluster. In this tool we incorporate all our ‘best practices’ and advice and create scripted logic to report the status of the nodes or Cluster.

Currently there are 579 different scenarios we check for using this tool. For this post, I’ve taken a snapshot of the top issues and risks we’ve seen at our Premier Support customers and mapped them to tasks you can perform to help you get your cluster as healthy as possible.

Top Issues

We can combine many of the top 40 issues into the same type or source of issue. For example, 11 of the top 40 issues relate to hotfixes and drivers. In fact, 62% of all of the top 40 issues are non-technical issues!

CSRAP Issue breakdown

So here are the best recommendations we would give to anyone running a Windows Server Cluster (in no particular order.)

It sounds obvious, but there’s this weird ideology out there that “these systems are too important to install updates on them”. The very fact that they are important means that you absolutely should be installing updates, particularly security updates. Don’t rely solely on Anti-Malware to protect you. And the updates from the hardware manufacturers are as important as those for Windows and the services running on the cluster, such as SQL Server, Exchange, Hyper-V, DFS etc.

Microsoft offers a free tool, called MBSA which can compare the current hotfix status of a range of computers to all updates available for the Microsoft Update catalog of all updates. I’d strongly recommend you start there.

90% of our customers are missing updates

75% are missing critical security updates

And once you identify that updates (software, driver, firmware, service) are needed, you must also have an approved outage window agreed with the business to install them and reboot. Your SLA metric which measures the agreed availability of the services should not include this window of time in its calculation. But being a Cluster, it’s a much faster process to move resources to alternate nodes so nodes can be freed up to apply updates.

Once you start applying updates, don’t stop at the “offline nodes” and ensure that the nodes do not differ – there should be matching software, driver and firmware versions.

52% of all our customers have pending reboots

56% have mismatched driver versions between nodes

Leave the firewall enabled

When you add the Failover Cluster feature the ports needed for proper communication between the nodes and for remote management are automatically opened. Also, as the roles to run as Clustered resources are added (e.g. File Services, DFS, Print Spooler, MSMQ) then the ports needed by those roles are automatically opened. So there’s no need to switch off the firewall.

It was also a common practice in Windows Server 2003 to disable the Client for Microsoft Networks on the heartbeat network. However this is not recommended any longer as SMB is needed on all networks. So using the Windows Firewall to block SMB traffic on this network is also not a good idea.

Preferred Owners may not do what you expect

A resource group will list all of the nodes in the Cluster as owners so it knows which node it will move to next. We can set a property of the resource group so that this list can be sorted to make the Preferred Owners always at the top. But the nodes which are not in the list of Preferred Owners are still Possible Owners if the Preferred Owners aren’t available.

Use the Possible Owners setting if you want to prevent the Resource Group from running on specific nodes. See this article for further information.

33% of all our customers have configured Preferred Owners when they wanted to use Possible Owners instead.

Windows Server 2012 also adds Anti-Affinity and Affinity settings for VMs, to keep specific VMs either off or on the same node.

Get a Disaster Recovery Plan and test it

A backup isn’t a backup until you have successfully run a restore. So, with that in mind, how would you restore your cluster configuration if there was an unauthorized change or a corruption? There are different scenarios you should be planning for:

  • Authoritative restore of a Sysvol backup to the existing cluster
  • Complete loss of the cluster requiring a restore to new hardware or VMs
  • A cluster with geographically dispersed notes where the shared storage is replicated and access to one location is lost

 

55% of our customers don’t have a Disaster Recovery plan for their Clusters

Customers with a Premier Support agreement can take advantage of a new service run by Premier Field Engineers called Cluster Service Recovery Execution Service, or CSRES. We’ll walk you through your restore of your Cluster on isolated equipment, using your own backups to make sure you have a tested Disaster Recovery Plan.

Know when your servers are running slowly

Clusters running Hyper-V or other high-performance workloads like SQL Server, Exchange and File Servers are most frequently bottlenecked by their disks. A disk bottleneck will cause slow responsiveness in the applications faster than any other type of bottleneck.

Disks can slow down for a number of reasons:

  • Competition for IOs on a LUN which is sharing spindles with another server
  • Wrong RAID type for the IO profile (mostly ready prefers RAID5, mostly writes prefers RAID1)
  • Caching settings disabled or not tuned to match IO profile
  • Flooded ports, switches
  • Incompatible combinations between the driver versions, firmware versions and switch software

But what metrics can you quickly check to tell if you have slow disks:

  • The time it takes to do reads or writes is greater than 15ms (Physical/Logical Avg. Disk Sec per Read/Write)
  • The outbound disk queue on the controller to the storage is constantly over 2 (Physical/Logical Avg. Disk Read/Write Queue Length) – though this metric will differ for SANs.

 

45% of all our customers which are running Failover Cluster have disks which are constantly slow or have long queues

While the Performance Monitor counters mentioned above do a great job of giving you a quick look into the current health of the disks, they don’t show you the cause of the problem. To do that, Premier Field Engineers run a hands-on workshop in advance performance monitoring for all server roles called Vital Signs. You’ll learn about tools like Performance Monitor, Resource Monitor and Process Explorer for diagnosing problems like slow disks.

One of my colleagues, Yong Rhee, has written a good checklist on troubleshooting slow disks on Windows servers here.

A cluster is for life, not just when the project goes live(1)

80% of an IT department’s budget is spent on maintaining existing IT investments and 20% on implementing new technologies. The care you take once a new service has been released is crucial to the survival of that service. All too often we forget about it and move on to the next new thing.

Our colleagues in the world of mainframe support wouldn’t dream of operating their highly available system without air-tight procedures to keep it running well, including a solid test and release plan. Contrast that with typical Windows clusters:

42% of our customers aren’t alerted when changes are made to their clusters

41% have no performance baseline of their clusters

37% have insufficient or no monitoring of their clusters

If you can hit those 3 issues using SCOM, Cluster Auditing, and Performance Monitor (or performance alerting in SCOM), you’ll be in much better shape.


(1) = “A dog is for life, not just for Christmas

Posted by MSPFE Editor, Arvind Shyamsundar