Storage, High Availability and Site Resilience in Exchange Server 2013, Part 3

Microsoft Exchange Server 2013 continues to innovate in the areas of storage, high availability, and site resilience. In this three-part blog series, I’ll describe the significant improvements in Exchange 2013 related to these three areas. Part 1 focuses on the storage improvements that we’ve made. Part 2 focuses on high availability. And Part 3 will focus on site resilience.

Although Exchange 2013 continues to use DAGs and Windows Failover Clustering for Mailbox server role high availability and site resilience, site resilience is not the same in Exchange 2013. Site resilience is much better in Exchange 2013 because it has been operationally simplified. The underlying architectural changes that were made in Exchange 2013 have significant impact on site resilience configurations and their recovery aspects.

Challenges with Exchange 2010 Site Resilience

Exchange 2010 undoubtedly made achieving site resilience for the messaging service and data easier than any previous version of Exchange. By combining the native site resilience features in Exchange 2010 with proper planning, you were able to activate a second datacenter to serve a failed datacenter's clients. The process you perform to do this is referred to as a datacenter switchover. This is a well-documented and generally well-understood process, although it takes time to perform, and it requires human intervention in order to begin the process.

Anyone who has performed a datacenter switchover in Exchange 2010 will tell you that they are operationally complex. This is in part because in Exchange 2010, recovery of mailbox data (DAG) and client access (namespace) are tied together. This leads to other challenges for Exchange 2010 in certain scenarios:

  • If you lose all or a significant portion of your Client Access servers, or the VIP for the array, or the top of rack (TOR), or if you lose a significant portion of your DAG, you were in a situation where you needed to do a datacenter switchover.
  • You could deploy a DAG across two datacenters and host the witness in a third datacenter and enable failover for the Mailbox role for either datacenter. But you didn’t get failover for the messaging service, because the namespace still needed to be switched over for the non-Mailbox roles.

But all of that aside, the biggest challenge with Exchange 2010 is that the namespace is a single point of failure. In Exchange 2010, the most significant single point of failure in the messaging system is the FQDN that you give to users because it tells the user where to go. Changing the IP address for that FQDN is not that easy because you have to change DNS and deal with DNS latency, which in some parts of the world is very bad. And you have name caches in browsers which are typically around 30 minutes or more that also have to be dealt with.

Exchange 2013 Addresses the Challenges

Significant changes have been made in Exchange 2013 that address the challenges with Exchange 2010 site resilience head on. With the namespace simplification, consolidation of server roles, removal of AD-sited-ness, separation of CAS array and DAG recovery, and load balancing changes, Exchange 2013 provides new site resilience options, such as the ability to use a single global namespace. In addition, for customers with more than two locations in which to deploy messaging service components, Exchange 2013 also provides the ability to configure the messaging service for automatic failover in response to failures that required manual intervention in Exchange 2010.

Specifically, site resilience has been operationally simplified in Exchange 2013. In addition, in Exchange 2013, the namespace does not need to move with the DAG. Exchange leverages fault tolerance built into the namespace through multiple IP addresses, load balancing (and if need be, the ability to take servers in and out of service). One of the most significant changes we made in Exchange 2013 was to leverage the clients’ ability to get more than one place to go. Assuming the client has the ability to use more than one place to go (which almost all HTTP clients do, and since almost all of the client access protocols in Exchange 2013 are HTTP based (Outlook, Outlook Anywhere, EAS, EWS, OWA, EAC, RPS, etc.), all supported HTTP clients have the ability to use multiple IP addresses), thereby providing failover on the client side. You can configure DNS to hand multiple IP addresses to a client during name resolution. The client asks for mail.contoso.com and gets back two IP addresses, or four IP addresses, for example. However many IP addresses the client gets back will be used reliably by the client. This makes the client a lot better off because if one of the IP addresses fails, the client has one or more others to try to connect to. If a client tries one and it fails, it waits around 20 seconds and then tries the next one in the list. Thus, if you lose the VIP for the CAS array, and you have a second VIP for a second CAS array, recovery for the clients happens automatically, and in about 21 seconds.

Modern HTTP clients (operating systems and Web browsers that are ten years old or less) simply work with this redundancy automatically. The HTTP stack can accept multiple IP addresses for an FQDN, and if the first IP it tries fails hard (e.g., cannot connect), it will try the next IP in the list. In a soft failure (connect lost after session established, perhaps due to an intermittent failure in the service where, for example, a device is dropping packets and needs to be taken out of service), the user might need to refresh their browser.

Operationally Simplified Site Resilience

So what does it mean that site resilience has been operationally simplified in Exchange 2013? Going back to the failure scenarios discussed above for Exchange 2010, if you encounter those scenarios in Exchange 2013, depending on your site resilience configuration, you might not need to perform a datacenter switchover. With the proper configuration, failover will happen at the client level and clients will be automatically redirected to a second datacenter that has operating Client Access servers, and those operating Client Access servers will proxy the communication back to the user’s Mailbox server, which remains unaffected by the outage (because you don’t do a switchover). Instead of working to recover service, the service recovers itself and you can focus on fixing the core issue (e.g., replacing the failed load balancer). Any administrator will tell you that the stress involved with replacing a failed piece of equipment that isn’t blocking service is much lower than the stress involved in restoring service and data access via a datacenter switchover.

So by comparison, in Exchange 2010, if you lose the load balancer in your primary datacenter and you don’t have another one in that site, you had to do a datacenter switchover. In Exchange 2013, if you lose the load balancer in your primary site, you simply turn it off (or maybe turn off the VIP) and repair/replace it.

Clients that aren’t already using the VIP in the secondary datacenter will automatically failover to the secondary VIP without any change of namespace, and without any change in DNS. Not only does that mean you no longer have to perform a switchover, but it also means that all of the time normally associated with a datacenter switchover recovery is not spent either. In Exchange 2010, you had to deal with DNS latency (hence, the recommendation to set the TTL to 5 min, and the introduction of the Failback URL). In Exchange 2013, you don’t need to do that because you get fast failover (~21 seconds) of the namespace between VIPs (datacenters).

Since you can failover the namespace between datacenters now, all that is needed to achieve a datacenter failover is a mechanism for failover of the Mailbox role across datacenters. To get automatic failover for the DAG, you simply architect a solution where the DAG is evenly split between two datacenters, and then place the witness server in a third location so that it can be arbitrated by DAG members in either datacenter, regardless of the state of the network between the datacenters that contain the DAG members. The key is that third location is isolated from network failures that affect the first and/or second location (the locations containing the DAG members).

In this scenario, the administrator’s efforts are geared toward simply fixing the problem, and not spent restoring service. You simply fix the thing that failed; all the while service has been running and data integrity has been maintained. The urgency and stress level you feel when fixing a broken device is nothing like the urgency and stress you feel when you’re working to restore service. It’s better for the end user, and less stressful for the admin.

You can allow failover to occur without having to perform switchbacks (sometimes mistakenly referred to as failbacks). If you lose CAS in your primary datacenter and that results in a 20 second interruption for clients, you might not even care about failing back. At this point, your primary concern would be fixing the core issue (e.g., replacing the failed load balancer). Once that is back online and functioning, some clients will start using it, while others might remain operational through the second datacenter.

Exchange 2013 also provides functionality that enables administrators to deal with intermittent failures. An intermittent failure is where, for example, the initial TCP connection can be made, but nothing happens afterwards. An intermittent failure requires some sort of extra administrative action to be taken because it might be the result of a replacement device being put into service. While this repair process is happening, the device might be powered on and accepting some requests, but not really ready to service clients until the necessary configuration steps are performed. In this scenario, the administrator can perform a namespace switchover by simply removing the VIP for the device being replaced from DNS. Then during that service period, no clients will be trying to connect to it. Once the replacement process has completed, the administrator can add the VIP back to DNS and clients will eventually start using it.

Conclusion

Microsoft Exchange Server 2013 continues to innovate in the areas of storage, high availability, and site resilience, with its plethora of new, innovative features, such as:

  • Multiple databases per volume
  • Autoreseed
  • Automatic recovery from storage failures
  • Lagged copy enhancements
  • Managed Availability
  • Best Copy and Server Selection enhancements
  • Maintenance Mode
  • Automatic DAG network configuration
  • Operationally simplified site resilience
  • Separation of Mailbox and Client Access recovery
  • Leveraging client-side DNS behaviors, such as IP failover for the namespace

It’s important to understand that major architectural changes had to take place in Exchange 2013 in order to enable these features. That means that while the Exchange 2010 design guidelines can apply to an Exchange 2013 organization, the additional enhanced Exchange 2013 design guidelines cannot be applied to Exchange 2010. All of the goodness above around new behaviors and design options applies to Exchange 2013 only.