Windows Failover Cluster Troubleshooter Data Grab

This blog post brought to you by eighteen year veteran Microsoft Premier Field Engineer David Morgan.

Goal of this Post

Over the years my customers have asked about what they should do first when they get a trouble ticket for a misbehaving Windows failover cluster. There are some fairly simple steps one can take first that can provide a host of benefits during the troubleshooting process like:

  • Faster problem resolution
  • A successful and faster root cause analysis
  • Faster service response times from vendor support personnel
  • Data about the event and the surrounding environment helpful in post mortems that can help prevent the same, and other, problems in the future
  • And more

This particular post isn’t about doing actual troubleshooting. Here I’m only going to go into the primary steps one should take before undertaking in-depth troubleshooting activities. Actual troubleshooting scenarios and details will follow in future posts where you’ll see why having captured these resources in the beginning can make your IT life a bit better.

Summary

  1. Immediately Capture all Cluster Logs
  2. Write a Very Detailed Description of the Problem
  3. Capture Microsoft Cluster Diagnostics Outputs
  4. Create a Cluster Validation Report

Detail

  • The most important task – immediately gather the cluster logs from all nodes.

If this is not done within ~72 hours (varies) the data logged about your problem event will be overwritten when the log wraps. In almost all cases if the cluster log is not available for the time of the event a reliable root cause cannot be provided.

  • To capture a cluster log from each machine in the cluster and place all the files in a specific location execute either of the following commands:
    • PowerShell (recommended for 2012 & 2012 R2)
      • Get-ClusterLog –destination “target-folder”
    • Cluster.exe (recommended for 2008 & 2008 R2)
      • Cluster.exe log /gen /copy:”target-folder”
        • Note: If you are using 2012 or 2012 R2, cluster.exe is a feature tool and must be added through the Add Roles & Features functions. Cluster.exe is planned to be deprecated in future releases.
  • At this time consider setting the cluster log level higher to gain more insight to the issue if it reoccurs:
    • Considerations:
      • Increasing the log level may affect overall system performance.
      • Increasing the log level will cause the log to wrap more frequently.
      • If the problem is one you can reproduce then:
        • Recommended for 2008 & 2008 R2
          • Determine the current cluster logging level
            • Cluster /prop:clusterloglevel
          • Increase the log level to 5
            • Cluster log /loglevel:5
          • Reproduce the issue
          • Capture the cluster logs
          • Rest the cluster log level to its default of 3
            • Cluster log /loglevel:3
        • Recommended for 2012 & 2012 R2
          • Determine the current cluster logging level
            • Get-Cluster | FL clusterloglevel
          • Increase the log level to 5
            • Set-Clusterlog –level 5
          • Reproduce the issue
          • Capture the cluster logs
          • Rest the cluster log level to its default of 3
            • Set-Clusterlog –level 3
  • As soon as possible collect the following diagnostic results from the cluster.
      • You will need to log in using a Microsoft account such as live.com, outlook.com, Hotmail.com, etc.
    • Once you are logged in enter “Failover Cluster” in the search field.
    • Your search results should provide a link to the Windows FailoverCluster Diagnostic.
    • Click on the link Windows FailoverCluster Diagnostic and choose create.
    • Next, choose Download and save the file to some location or you can choose Run.
    • After executing the download choose:
      • Run now on this PC
        if the desktop you are on is one of the cluster nodes.
      • Save to run later on another PC
        if the desktop you are on is not a cluster node.
    • After executing the diagnostics package, you will be taken to a screen allowing you to select which nodes you wish to collect information from.
      • It is best to have diagnostics for all the nodes in the cluster. However, there may be reasons for you to choose only a subset and run the diagnostics tool more than once with different nodes in the collection.
        • The primary reason for this is that the tool will compress no more than 2GB of collected data. With very large clusters, it is easy to reach or surpass this threshold. If you run the tool against a large number of nodes, the collection will be finished when you see the screen titled
          “Review the diagnostic results before you send the item”
          appear. Before choosing next and compressing the data, it would be prudent to check the temporary location where the captured files are located first and determine the total size. If it is greater than 2GB it would be prudent to copy all the files to another location as when the tool fails because of the size limitation the temp location files are deleted.

          The temporary file location is:

          • %WINDIR%\TEMP\SDIAG_{GUID} (where GUID represents a diagnostic execution)
      • Next choose a location to save the diagnostics output
      • A folder named Upload Results will be created that contains a compressed file with a .cab extension. Save the file Results….cab and delete the remaining files in the folder.
      • If you run into other issues this FAQ is extensive:
        • 2598970
          Information about the Microsoft Automated Troubleshooting Services and Support Diagnostic Platform
  • Capture a Failover Cluster Validation Report
    • From within Failover Cluster Manager run the Cluster Validation Wizard and collect the results of all tests.
      • If there is no storage in your cluster that can be taken offline, the storage tests will not be run. If your issue is likely a storage problem, then the storage test data is important. The simplest method around this is to introduce a single, free volume from the SAN to the cluster. The Storage Validation tests can then be run against this single disk and will return all storage test information with the exception of specifics about the disks that are in an online state when the storage tests are run.
      • When completed the final .mht format report will be found in the following directory:
        • %windir%\cluster\reports

Finally, store all these files in case you later work with a Microsoft support engineer. You’ll be amazed at how much faster your support call can go if you already have this data collected and ready to upload to your support vendor.