Window Azure Fault Domain and Upgrade Domain Explained Explained (Reprised)

Article
05/14/2013

Some noticeable advantages to run applications in Windows Azure are availability and fault tolerance achieved by the so-called fault domains and upgrade domains. These two terms represent important strategies in cloud computing for deploying and upgrading applications. System Center 2012 SP1 also has integrated the concepts into Virtual Machine Manager Service Template for deploying a private cloud.

Fault Domain

The scope of a single point of failure is essentially a fault domain. And the purpose of identifying/organizing fault domains is to prevent a single point of failure. In a simplest form, a computer by itself connected to a power outlet is a fault domain. Apparently if the connection between a computer and its power outlet is off, this computer is down. Hence a single point of failure. As well, a rack of computers in a datacenter can be a fault domain since a power outage of a rack will take out the collection of hardware in the rack similar with what is shown in the picture here. Notice that how a fault domain is formed has much to do with how hardware is arranged. Anda single computer or a rack of computers is not necessarily an automatic fault domain. Nonetheless, in Windows Azure a rack of computers is indeed identified as a fault domain. And the allocation of a fault domain is determined by Windows Azure at deployment time. A service owner can not control the allocation of a fault domain, however can programmatically find out which fault domain a service is running within.Windows Azure Compute service SLA guarantees the level of connectivity uptime for a deployed service only if two or more instances of each role of a service are deployed.

Upgrade/Update Domain

On the other hand, an upgrade domain is a strategy to ensure an application stays up and running, while undergoing an update of the application. Windows Azure when possible will distribute instances evenly into multiple upgrade domains with each upgrade domain as a logical unit of a deployment. When upgrading a deployment, it is then carried out one upgrade domain at a time. The steps are: stopping the instances running in the first upgrade domain, upgrading the application, bringing the instances back online followed by repeating the steps in the next upgrade domain. An upgrade is completed when all upgrade domains are processed. By stopping only the instances running within one upgrade domain, Windows Azure ensures that an upgrade takes place with the least possible impact to the running service. A service owner can optionally control how many upgrade domains with an attribute, upgradeDomainCount, in Windows Azure Service Definition Schema, .csdef file.

Observations

Within a fault domain, the concept of fault tolerance does not apply. Since all is either up or down and with no tolerance of a fault. Only when there are more than one fault domains and managed as a whole, is fault-tolerance applicable. In addition to fault domains and upgrade domains, to ensure fault tolerance and service availability, Windows Azure also has network redundancy built into routers, switches, and load-balancers. FC also sets check-points and stores the state data across fault domains to ensure reliability and recoverability.

Notice that the terms, Upgrade Domain and Update Domain, as referenced in this article and

Azure Service Definition Schema (.csdef File)
VMM service template properties
Azure SLAs

all represent the same concept which is a logical construct for introducing changes into a running service without outage.

Window Azure Fault Domain and Upgrade Domain Explained Explained (Reprised)

Fault Domain

Upgrade/Update Domain

Observations

Additional resources