Hello, my name is Chuck Timon and this is my first blog post as a Premier Field Engineer. Previous to my current position, I posted to the Core Team blog and the System Center: Virtual Machine Manager Engineering Blog.
In this post, I examine a customer issue where two cluster network name resources in a 2-Node SQL 2012 Failover Cluster Instance (FCI) running in a Windows Server 2012 R2 cluster failed to come Online on one of the nodes in the cluster. The Network Name resources were associated with an MSDTC Resource Group and the SQL Server Resource Group.
When troubleshooting Failover Clusters, where you can reproduce an issue on demand, it is best to examine the System Log (the Failover Cluster provider registers events in the System log) on the Node where the failure occurs and also gather the cluster log itself where more detailed logging is available. To gather the cluster log, use the Get-ClusterLog PowerShell cmdlet. Here is an example of collecting a cluster log on a cluster node where a problem occurred, placing the log in the C:\Temp directory, collecting data from the last 2 minutes (which contained the error) and generating the log using the local time on the host.
Examining the System log in the customer environment for cluster failure events, the normal Event ID 1069 error messages, which are associated with the failure of any cluster resource, were observed. However, these cluster events lack the necessary detail to effectively troubleshoot most cluster issues so examination of the cluster log was warranted.
Examination of the cluster log from the node experiencing the issue revealed additional information including a specific error code. Here are a couple of snippets from the cluster log showing the error code (2114) –
Decoding the 2114 error indicated that one potential cause was the Server Service was not starting.
Initially, this seemed pretty odd because the Server Service is very reliable, but it was worth a look. Opening the services snap-in, showed that the Server Service had not started even thought it was set to start automatically (default setting). We tried starting it, and it failed with the following error –
Inspecting the Properties\Dependencies of the Server Service, showed there was information missing in the customer environment –
Next, I inspected the following registry entries on both nodes in the cluster for differences and found none.
I was specifically looking for a key value that indicated one or more dependent services were ‘marked for deletion’ like the error message stated. Not finding that value, I used the SC command line utility to query the appropriate services and determined the Lanmanserver service was stopped and could not be started on the cluster node. Normal output using the SC command looks like this –
The customers’ output reflected a ‘State’ of STOPPED.
At that point, the decision was made to export the HKLM\System\CurrentControlSet\Services\Lanmanserver registry key (on the problem server) to the desktop, remove the DependOnService entry in the key (so something would be different when re-registering), and reboot the server. Following a reboot, the exported registry key was used to re-register the service, and then the Server Service was able to start. The customer moved all cluster resources to the problematic node and all the Network Name resources that previously failed to come Online, were able to successfully come Online thus making SQl services once again highly available in the 2-Node cluster.
Thanks for reading our blog, and I hope you found this helpful.