IMPORTANT ANNOUNCEMENT FOR OUR READERS!
AskPFEPlat is in the process of a transformation to the new Core Infrastructure and Security TechCommunity, and will be moving by the end of March 2019 to our new home at https://aka.ms/CISTechComm (hosted at https://techcommunity.microsoft.com). Please bear with us while we are still under construction!
We will continue bringing you the same great content, from the same great contributors, on our new platform. Until then, you can access our new content on either https://aka.ms/askpfeplat as you do today, or at our new site https://aka.ms/CISTechComm. Please feel free to update your bookmarks accordingly!
Why are we doing this? Simple really; we are looking to expand our team internally in order to provide you even more great content, as well as take on a more proactive role in the future with our readers (more to come on that later)! Since our team encompasses many more roles than Premier Field Engineers these days, we felt it was also time we reflected that initial expansion.
If you have never visited the TechCommunity site, it can be found at https://techcommunity.microsoft.com. On the TechCommunity site, you will find numerous technical communities across many topics, which include discussion areas, along with blog content.
NOTE: In addition to the AskPFEPlat-to-Core Infrastructure and Security transformation, Premier Field Engineers from all technology areas will be working together to expand the TechCommunity site even further, joining together in the technology agnostic Premier Field Engineering TechCommunity (along with Core Infrastructure and Security), which can be found at https://aka.ms/PFETechComm!
As always, thank you for continuing to read the Core Infrastructure and Security (AskPFEPlat) blog, and we look forward to providing you more great content well into the future!
Hello, my name is Chuck Timon and this is my first blog post as a Premier Field Engineer. Previous to my current position, I posted to the Core Team blog and the System Center: Virtual Machine Manager Engineering Blog.
In this post, I examine a customer issue where two cluster network name resources in a 2-Node SQL 2012 Failover Cluster Instance (FCI) running in a Windows Server 2012 R2 cluster failed to come Online on one of the nodes in the cluster. The Network Name resources were associated with an MSDTC Resource Group and the SQL Server Resource Group.
When troubleshooting Failover Clusters, where you can reproduce an issue on demand, it is best to examine the System Log (the Failover Cluster provider registers events in the System log) on the Node where the failure occurs and also gather the cluster log itself where more detailed logging is available. To gather the cluster log, use the Get-ClusterLog PowerShell cmdlet. Here is an example of collecting a cluster log on a cluster node where a problem occurred, placing the log in the C:\Temp directory, collecting data from the last 2 minutes (which contained the error) and generating the log using the local time on the host.
Examining the System log in the customer environment for cluster failure events, the normal Event ID 1069 error messages, which are associated with the failure of any cluster resource, were observed. However, these cluster events lack the necessary detail to effectively troubleshoot most cluster issues so examination of the cluster log was warranted.
Examination of the cluster log from the node experiencing the issue revealed additional information including a specific error code. Here are a couple of snippets from the cluster log showing the error code (2114) –
Decoding the 2114 error indicated that one potential cause was the Server Service was not starting.
Initially, this seemed pretty odd because the Server Service is very reliable, but it was worth a look. Opening the services snap-in, showed that the Server Service had not started even thought it was set to start automatically (default setting). We tried starting it, and it failed with the following error –
Inspecting the Properties\Dependencies of the Server Service, showed there was information missing in the customer environment –
Next, I inspected the following registry entries on both nodes in the cluster for differences and found none.
I was specifically looking for a key value that indicated one or more dependent services were ‘marked for deletion’ like the error message stated. Not finding that value, I used the SC command line utility to query the appropriate services and determined the Lanmanserver service was stopped and could not be started on the cluster node. Normal output using the SC command looks like this –
The customers’ output reflected a ‘State’ of STOPPED.
At that point, the decision was made to export the HKLM\System\CurrentControlSet\Services\Lanmanserver registry key (on the problem server) to the desktop, remove the DependOnService entry in the key (so something would be different when re-registering), and reboot the server. Following a reboot, the exported registry key was used to re-register the service, and then the Server Service was able to start. The customer moved all cluster resources to the problematic node and all the Network Name resources that previously failed to come Online, were able to successfully come Online thus making SQl services once again highly available in the 2-Node cluster.
Thanks for reading our blog, and I hope you found this helpful.