When should I evict a cluster node?

image 

I thought I’d post a quick blog on this topic since we run into cases where evicting a cluster node is used as a troubleshooting step. That being said, evicting a node should NEVER be a primary troubleshooting step.

Evicting a node to try and resolve a cluster issue may get you deeper in the hole and ultimately make the issue more complex than it started out.  As an example, you originally started with a failover issue.  You evict the node but now you can’t get the node back into the cluster. Since you can no longer add the node back, you have this secondary issue that must be resolved before you can address your original problem.

In my experience of working many cluster issues, I have never resolved an issue by evicting a node. The only times you should ever evict a node are under the following scenarios.

  • Replacing a node with different hardware.
  • Reinstalling the operating system.
  • Permanently removing a node from a cluster.
  • Renaming a node of a cluster.

Let’s take a look at some very common scenarios where I’ve seen evicting a node used improperly.

Cluster service won’t start on node 2 of a cluster. Node 2 is evicted from the cluster. The original problem with why the cluster service didn’t start is still there but now that same problem also prevents node 2 from coming back into the cluster.

Resources don’t failover to node 2. Every time a failover occurs, the disks don’t come online and fail back to node 1. One of the nodes is evicted and then added back to the cluster. None of this addresses the disk issue so problem still remains.

If the reason for the disk failure is an Error 2, then the drives not seen properly by the evicted node. So when you go to try and add the evicted node back in and take the defaults, it could error trying to join back with this error in CLCFGSRV.LOG

Major Task ID: {B8C4066E-0246-4358-9DE5-25603EDD0CA0}
Minor Task ID: {3BB53C9E-E14A-4196-9066-5400FB8860C9}
Progress (min, max, current): 0, 1, 1
Description:
Checking that all nodes have access to the quorum resource
Status: 0x800713de
The quorum disk could not be located by the cluster service.
Additional Information:
For more information, visit Help and Support Services at
https://go.microsoft.com/fwlink/?LinkId=4441.

I could go on and on but the point I am trying to make is that unless you fall into the four specific scenarios I mention, don’t evict your cluster nodes. Your Microsoft Support Engineers thank you and your users will thank you.

Jeff Hughes
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support