Symptom based cluster disk failure troubleshooting

One of the most common issues with Windows Failover clustering deals with the storage attached to the cluster. This post is meant to give a very high-level overview to disk troubleshooting methodologies and how to narrow down the source of problems even before any logs or diagnostic data is gathered. This post will deal particularly with troubleshooting issues where a 'Physical Disk' resource is in a 'Failed' state. Issues dealing with adding new disks into an existing cluster or migrating disks between clusters will be covered in future posts.

** Note: If using 3rd party disk resources like 'Veritas Volume Manager Disk Group (vxres.dll)' or 'IBM ServeRAID Logical Disk (ipsha.dll)', engage that vendor directly as this article only pertains to cluster disk resources of type 'Physical Disk (clusres.dll)'.

The first step in troubleshooting disk failures is determining the extent of the problem. A few short tests can be performed on the cluster that narrow down where to begin more in depth troubleshooting. It is these steps that will be covered in this post. In the following paragraphs, I'll describe the possible symptom scenarios and give each scenario a potential root cause.

It's a good idea when troubleshooting disk resource failures to set the disk resource in question to 'do not restart'. This will keep the group the disk is located in from failing back and forth between nodes. Once the resolution has been found, don't forget to set the disk resource back to the default 'restart/affect the group' settings.

Here are the four possible combinations of symptoms that could help you narrow down where the potential disk problem lies. By failing over the group containing the failed disk resource, you’ll determine if the issue is disk specific or something more specific to the disk subsystem. By also failing over groups that contain disks that are working properly, you’ll also have a good idea of what is working and can eliminate those areas from possible focus.

  • Disk resource fails on one node but works on the other node(s).
    • If failure is limited to just one disk
      • Most common problem in this scenario would be one of zoning or LUN masking. In SAN environments, it's the SAN itself that gets configured with what server sees what disk. In a standalone environment, normally this would be a 1:1 relationship. In a clustered environment, a SAN disk must be presented to two or more nodes. If this configuration is not done correctly, you could end up with a server that has no "logical' connectivity to a disk and that's why the disk is failing.
    • If no disk resources come online on problem node
      • Our strongest culprit here would be either the HBA itself or a configuration problem with the multipathing software. If the HBA is failing/failed, that would explain why all disks work on the other node(s) but not this one. We know it's not a signature issue because if it were, the disks wouldn't work on any node. A good troubleshooting step would be to shutdown the OS on all but the problem node, disable the cluster service and disk driver on the problem node and reboot. (**For details in disabling the cluster disk driver, see the end of this blog). This process effectively removes the cluster components from the equation. If you still can't see or access the disks at this point, time to engage your storage vendor
  • Disk resource fails to come online on either node.
    • If failure is limited to just one disk.
      • We could be looking at a couple scenarios here. Most common here would be that the signature on the disk that the cluster is looking for has changed. Another problem I've seen is that another device may either have an exclusive lock on that disk or a SCSI reserve is held on the disk and the cluster cannot clear that reserve as part of the normal arbitration process.
    • If all disks fail to come online on either node.
      • If you are in this scenario, you are probably looking at either a catastrophic fiber switch failure or someone powered off the SAN. There aren't many issues at the OS level that would manifest symptoms like this.

** Disabling the cluster disk driver.

In Device Manager, right click on Device Manager/View/Show Hidden devices.

image

On the right, you should now see an icon for the cluster disk driver

image

Right-click on that driver/Properties, On the [Driver] tab, set the startup type to 'Demand'.

image

Now, set the cluster service to 'disabled' and reboot. Once the server comes up, you will be operating without any cluster components in place. Never do this process on more than one node in a cluster at a time. To reverse this process, set the cluster disk driver to 'system', start the driver, set cluster service to 'automatic' and start the service.

Author: Jeff Hughes
Microsoft Enterprise Platforms Support
Support Escalation Engineer