Failover Clustering Validation Wizard Fails when Enterprises uses IPv6 and IPv4

Recently, I had a server that I had to forcibly evict from our Hyper-V Cluster farm due to some downtime for hardware failures.  Instead of having it serve as offline, I simply migrated the clients off before the failure occurred using System Center Virtual Machine Manager R2’s Maintenance mode.  Then I used the Failover Cluster manager to evict the node.

Simple enough, right?  You’d think…

‘Validate This Cluster’ Failures - Networking

After the hardware was replaced, I went through the add Node wizard and much to my surprise my cluster was no longer “healthy”.  The key thing I learned here is that the validation report is not only valuable when used as a gatekeeper of entry into the cluster but also anytime “major” changes occur in the Cluster infrastructure.  In my case, I had made some small and subtle changes at the network layer and this caused my fun.

My first suggestion is that when you modify the hardware such as storage or network then you should absolutely run the validation report (see Figure 1) to ensure your cluster is still healthy.

Read the Manual Reminder

Figure 1:  Your Read the Manual Reminder

In my case, I had modified a few things to improve my performance of my cluster, in particular, Live Migration.  It was ironic that at the surface the network layer was configured perfectly on my system.

Implications of IPv6

I ran into several failures on each of the nodes and on each one I kept getting the exact same error.  It was frustrating, and like many engineers, I would only wish I hadn’t clustered this thing and then I could reboot.  The idea behind clustering is to ensure that highly available applications are just that, highly available.  They aren’t accustomed to having and engineer just up and reboot them whenever little blips occur.

IPv6 is a powerful protocol that is just turning the corner and will one day be mainstream.  At Microsoft, we use it daily throughout our infrastructure and it is usually frowned upon if one disables the protocol.  The engineering lab we use for development purposes has some unique lab environments that utilize local Hyper-V networks that only run on a single server rather than throughout the cluster.  For the purposes of these local networks, we’ve disabled IPv6 bindings.

Disabling IPV6 Bindings

The mechanism to disable the IPv6 binding is to use the Network Control panel advanced tab.  The binding order is important for clusters so you might find yourself spending more time in there than you usually like.  For my purposes, I had disabled for the ‘Local Area Connection’ for the Hyper-V local network by doing these steps:

  1. Open Control Panel
  2. Click Network and Sharing Center
  3. Click “Change Adapter Settings”
  4. Hit the ALT key and select Advanced

image

Figure 2:  Advanced Bindings

In the Binding order, uncheck the use of IPv6.   For my sanity, I went ahead and unbound it and then rebooted the server.

Disabling IPv6 – Don’t do it as it isn’t Needed

I only shared with you how to disable IPv6 bindings in order to say – don’t do it.  it doesn’t make any sense in most cases and absolutely doesn’t hurt anything.  In fact, I’ve found that my disabling of IPv6 caused more headaches than had I just left all of it alone.  I try everything sometimes when attempting to learn new technology and especially when something is broken.  For our purposes, let’s go back into the machine and turn it back on.  Run the configuration wizard again…

IPv6 Addresses still Missing

A successful IPv6 binding will usually include Teredo addresses, etc. as this is an important technology for services like IP 6-4, etc.  When you do a IP configuration dump of your machine you often will see addresses like the following:

image

Figure 3:  Investigate your Global Prefix/Subnet/Interface Id and verify correct Addresses

The IPv6 address, at first glance, will knock you for a loop if your used to the customary IPv4 addresses.  It looks foreign, like something from outer space.  As mentioned, Microsoft has used IPv6 for quite some time and after awhile you just start to understand what is a “valid” IPv6 address and which one might be having some issues.

For example, most IPv6 shops will not be native IPv6 without the need to support IPv4 so they will use some technology to ensure that IP translations can occur across the two networks.  It should be noted that IPv4 & IPv6 are two different protocols, running on completely separate networks, and IPv4 networks don’t know they have a bigger, more powerful “big brother” sharing the network.  Hence, if you see addresses that has only 2001:4448:0:fff:0:5efe a global prefix & subnet then you know you probably have an issue.

In my case, I noticed …

Welcome IP Helper Service

Back to the topic at hand, as I mentioned I had remove a Node from the failover cluster due to hardware issues and a substantial outage due to this.  Because of this, I thought I would fix the hardware issue and then re-introduce the host back to the cluster.  Boy was I wrong…

Instead, I got things on the hardware side working but before long I noticed that everything seemed to be falling apart according to the Validation Report.  I could never seemed to get things happy.  In my report, I started getting the following errors on the existing hosts in the cluster -

image

Figure 4:  Validation Report Error Reports

You may be asking me, it is obvious what the problem is.  Well, you are one better than I because what I couldn’t figure out is the complaints from the report were coming from two hosts who *were working* and hosting virtual machines.  Absolutely no problems at all!

What led me to figure out the issue is a closer examination of the subnet prefix on the two hosts, who ironically, didn’t have IPv6 configured with a valid address.  In fact, they had the global prefix and the subnet but no interface Id information.  In Windows Vista/2008 and beyond, there is a critical service that ensures that your IPv6 adapters are effective in communicating with other IPv6 devices but also IPv4.  This service, IP Helper Service, is often ignored and honestly “just works.”  In our case, these two servers through the course of disabling & the re-enabling of the IPv6 binding (see above) got the wires crossed and wasn’t happy. 

The first reaction that many of you would say is… come on, just reboot.  Well, in our lab we didn’t at our current capacity have the ability to Live Migrate all of the virtual machines to other hosts without over-committing the cluster.  This isn’t production so I fudged a bit on the math to save some budget so we can really only have 2 servers out before we hit over commit.  Thus, rebooting could cause machines to go offline and as such I wanted to figure the problem out that day, that minute, or in short – I was impatient.

image

Figure 5:  IP Helper Service

<resolution>

To resolve the issue and start passing validation reports, I went into the server’s services (Start –> services.msc) and located the IP Helper service and I restarted it. After doing this on both of the servers, I effectively resolved the issue.

</resolution>

Summary

Moral of the story – don’t mess with your bindings unless you have to.  If you absolutely have to, then understand the impact and how the services are intertwined and further more listen to what the validation report tells you.

Thanks,

-Chris

Digg This