The most popular question everyone asks themselves before calling Microsoft Customer Service and Support (CSS) for assistance in determining a Root Cause Analysis be done.
Why did the resources failover to the other node?
Some times a Root Cause Analysis for a Failover Cluster can be very time consuming, especially if it’s a Windows 2003 (8) node Failover Cluster. Even though the references listed below may state they are for Windows 2000 Advanced Servers Cluster Service (MSCS), the same references can be used in the analysis of a Windows 2003 Failover Cluster.
Here’s how we begin with the Root Cause Analysis:
- Some customers may be asked by Microsoft CSS Support Professional to gather either a Cluster MPS Report, SQL MPS Report and/or Exchange MPS Report which will be determined by whatever role the Cluster may be used for whether it be SQL, Exchange, File/Print, etc, etc. This data gathering tool is slowly being replaced by the Microsoft Support Diagnostic Tool (MSDT) which started shipping in Windows Vista and is now widely being used by all operating systems.
- Once the data is captured for ALL nodes (yes, we would like to have it for all nodes for several reasons to follow), we first begin to review the System Event Logs to determine when the last failure occurred which is usually a very generic Event ID 1069. Then we work our way backward in time and compare it to any other possible events that could have also been logged in the Application Event Logs.
- If the Microsoft CSS Support Professional reaches an agreement with the customer on the timeline of the failure, then the Cluster Logs are reviewed for further analysis in determining root cause analysis.
NOTE: When reviewing Custer Logs, keep in mind that the date/time logged is in Greenwich Mean Time (GMT). Also, the default Cluster Log Size set to have a maximum size of 8 megabytes (MB). But this size can be changed to store a larger history by increasing a User Environment Variable called ClusterLogSize as mentioned in article ID: 168801.
- Some common Cluster Log snippets listed below, can usually be scanned looking for keywords by searching on ERR and/or WARN. But searching on the word status can also provide you some indication of what’s happening, such as listed below:
- status 170 – Which means “The requested resource is in use.” This could be related to Persistent Reservation problems, it can also be MPIO, fibre/HBA drivers and/or some type of lower level file system driver or software such as anti-virus, quota management, open file agent for backup software, etc, etc,:
00000c94.000008d4::<date and time>.585 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 33af636f. 00000c94.000008d4::<date and time>.616 ERR Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 170.
00000c94.000008d4::<date and time>.616 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170.
- status 1117 – Which means an ERROR_IO_DEVICE (The request could not be performed because of an I/O device error) when Event ID 1123 occurs.
000015a0.000014a8::<date and time>.511 WARN IP Address <IP Address resource name>: IP Interface 4 (address 10.101.160.65) failed LooksAlive check, status 1117, address 0x10119e0, instance 0xf74d6fb8.
000015a0.000014a8::<date and time>.511 WARN IP Address <IP Address resource name>: IP Interface 4 (address 10.101.160.65) failed IsAlive check, status 1117, address 0x10119e0, instance 0xf74d6fb8.
- status 5 – Is usually a permissions related problem, in this case it was a problem with either Cluster Service Account (CSA) username/password were not synchronized between the nodes. This can also happen if the cluster looses it’s Secure Channel connection to the DC in order for the CSA to get authenticated. Another situation in which this can occur, is when one of the domain Group Policy Objects (GPO) or one of the Local Policy Objects is missing a User Rights Assignment needed for the CSA to funtion properly.
000014a0.00001460::::<date and time>.629 WARN [JOIN] JoinVersion data for sponsor <Cluster Name> is invalid, status 5.
000014a0.000017d0::::<date and time>.629 WARN [JOIN] Unable to get join version data from sponsor 10.7.47.100 using NTLM package, status 5.
000014a0.000017d0::::<date and time>.629 WARN [JOIN] JoinVersion data for sponsor 10.7.47.100 is invalid, status 5.
000014a0.00000438::::<date and time>.629 WARN [JOIN] Unable to get join version data from sponsor 22.214.171.124 using NTLM package, status 5.
000014a0.00000438::::<date and time>.629 WARN [JOIN] JoinVersion data for sponsor
126.96.36.199 is invalid, status 5.
- Once done reviewing the Cluster Logs, we begin to review some of the other logs collected to determine if there are any outdated and/or driver discrepancies such network interface cards (NICs), fibre/HBA, multi-path (MPIO), NIC Teaming software or hardware discrepancies such as server/component not being on the Windows Server Catalog as one of the many valid Cluster Solutions.
Since there’s not one single point of reference in determining why the cluster resources failed over, the following are some of the ones used in getting started.
Techniques for Tracking the Source of a Problem
Anatomy of a Cluster Log Entry
Interpreting the Cluster log
892422 Overview of event ID 1123 and event ID 1122 logging in Windows 2000-based and Windows Server 2003-based server clusters
914458 Behavior of the LooksAlive and IsAlive functions for the resources that are included in the Windows Server Clustering component of Windows Server 2003
242450 How to query the Microsoft Knowledge Base by using keywords and query words
926079 Frequently asked questions about the Microsoft Support Diagnostic Tool (MSDT)
Thanks and remember, doing RCA is very tedious. This is just a guide to get you pointed down the right path.
Author: Mike Rosado
Microsoft – Windows Server – Enterprise Platforms Support – Core team (Setup, Cluster and Performance)