What’s new in failover clustering: #05 Resilient private cloud

By Microsoft Windows Server Team

Content type
Updates

Product
Windows Server 2016

Solution
Storage

Tags
Failover Cluster

This post was authored by Subhasish Bhattacharya, Program Manager, Windows Server.

Introduction

In the past, in a world of reliable but expensive SANs, an aggressive high-availability strategy designed to fail fast was most optimal. The health of the system would be closely monitored to detect issues and react quickly and swiftly. This minimized downtime when catastrophic failures occurred.

In today’s cloud-scale environments, commonly comprising of commodity hardware, transient failures have become more common than hard failures. These transient compute and storage failures in commodity hardware are triggered by common events such as switch reset, packet loss, latency, and spanning tree convergence. In this new world, reacting aggressively to handle transient failures can cause more downtime than it prevents.

The storage and compute stack in Windows Server 2016 has been designed to optimize both high availability and resiliency. In a Software Defined Datacenter, we must assume infrastructure will break and it is imperative that software is resilient. At the same time, it is not acceptable to have degraded Virtual Machine (VM) availability.

Resilient private clouds: Compute and storage virtual machine resiliency

Windows Server 2016 introduces increased VM resiliency features to address both:

Compute failures: Due to east-west transient network failures.
Storage failures: Due to north-south transient storage failures.

Compute resiliency

Transient network failures impede intra-cluster communication for your private cloud. This results in cluster nodes being removed from active membership in a cluster. In Windows Server 2016, your cluster is resilient to intra-cluster communication failures. This resiliency is achieved by the following:

A VM continues to run on a node even when it falls out of cluster membership. In this state, the node is considered to be in an “isolated” state and the VM is “unmonitored” – i.e., its health is not being actively monitored by the cluster service.
If the network connectivity of the “isolated” node fails to recover within a certain duration, the VM is live-migrated to another node in the cluster. Note that this results in no downtime for the VM.
Additionally, “flapping” nodes, which constantly come in and out of cluster membership, are temporarily banished and placed in a “quarantined” state.

Storage resiliency

A transient storage failure results in a VM being unable to access its underlying VHDX file since read or write requests to disk fail. In Windows Server 2016, a VM is able to seamlessly detect and be resilient to such transient failures as follows:

On detecting a transient storage failure, the tenant VM session state is preserved.
Any failure in block- or file-based storage infrastructure is handled by the VM stack, triggering an intelligent and quick response.
The VM is moved to a “PausedCritical” state as it waits for the storage to recover.
On recovery from the transient failure, the session state is restored.

To try this new feature in Windows Server 2016, download the Technical Preview. For additional details, see the feature blog posts for compute and storage VM resiliency.

Check out the series:

–    #01 Cluster OS Rolling Upgrade
–    #02 VM Load Balancing
–    #03 Stretched Clusters
–    #04 Workgroup and multi-domain clusters

Updates
•
Jan 23 •

4 min read
How Hotpatching on Windows Server is changing the game for Xbox

In this article you’ll learn how Microsoft has been using Hotpatch with Windows Server 2022 Azure Edition to substantially reduce downtime for SQL Server databases running on…
Events
•
Dec 4, 2023 •

4 min read
Windows Server and SQL Server at Microsoft Ignite 2023

This year, Microsoft Ignite 2023 took place in Seattle, Washington from November 12 to 15, 2023 and it was such a wonderful experience to meet and interact…
Updates
•
Oct 10, 2023 •

4 min read
Secure Windows Server 2012/R2 workloads with options from Azure

October 10th, 2023 marks the end of support date for Windows Server 2012/R2 and we want to outline options for customers to stay protected and compliant.

Introduction

Resilient private clouds: Compute and storage virtual machine resiliency

Compute resiliency

Storage resiliency

Related posts

How Hotpatching on Windows Server is changing the game for Xbox

Windows Server and SQL Server at Microsoft Ignite 2023

Secure Windows Server 2012/R2 workloads with options from Azure