Recently I was re-reading a blog post from 2012 about on Tuning Failover Cluster Network Thresholds that was posted by Elden Christensen, a Principal PM on the Windows Failover Cluster team. I think Elden’s post is a must-read for anyone planning a highly-available Exchange deployment. One of the reasons I find this post an excellent read is that it addresses many things administrators need to understand when they decide to tune cluster heartbeat subnet delays and thresholds in Exchange environments.
The post first makes an excellent point in addressing that the changes to these thresholds alter the amount of time it takes to detect that a node is down. Elden uses a great metaphor here:
“Think of it like your cell phone, when the other end goes silent how long are you willing to sit there going “Hello?… Hello?… Hello?” before you hang-up the phone and call the person back. When the other end goes silent, you don’t know when or even if they will come back.”
As subnet thresholds are adjusted up, this increases the amount of time it takes to detect a failure. The higher the values, the longer it takes to detect a failure, and therefore the longer it takes to act on that failure. There is a balance between reacting quickly to a failure and providing resiliency to transient networking issues.
The other point that I think is worth understanding is the number of times these values are adjusted in the absence of an analysis or correction of underlying networking issues. Elden sums this up, too, and I could not agree with him more:
“It critical to recognize that cranking up the thresholds to high values does not fix nor resolve the transient network issue, it simply masks the problem by making health monitoring less sensitive. The #1 mistake made broadly by customers is the perception of not triggering cluster health detection means the issue is resolved (which is not true!). I like to think of it, that just because you choose not to go to the doctor it does not mean you are healthy. In other words, the lack of someone telling you that you have a problem does not mean the problem went away.”
I often find myself in conversations with customers who have changed these values and have the perception that something is “fixed.” There are legitimate cases where these values need to be changed – but I always encourage a networking analysis enables you to understand what issues you are facing and how adjusting these values would help. Unfortunately, it seems that adjusting these thresholds without this understanding is far more common than it should be.
I strongly encourage all Exchange administrators to read Elden’s post.