Understanding Internet Database Availability Groups

The high availability and site resilience platform built into Exchange 2010 provides automatic recovery from storage, network and server and other failures that affect active mailbox database copies. In addition, it provides manual recovery from failures affecting an entire site or datacenter.

Customers have asked for, and many have tried desperately to create, an Exchange 2010 architecture or design that provides automatic failover from datacenter-level events. At the time we released the RTM version of Exchange 2010, we strongly felt that manual recovery from events that affect an entire datacenter (for example, loss of power, fire, natural disaster, etc.) was preferable to automatic recovery by the system. After all, the system has no awareness of the nature of the failure and it can’t really make intelligent decisions as to whether or not users should have service and data access moved to an alternate datacenter.

But after more than five years of running Exchange in a cloud-based service environment in one form or another (Business Productivity Online Suite, Live@EDU, Office 365, etc.) we have realized that automatic recovery could be accomplished by the system in an intelligent way, provided the system had enough replication endpoints, and a sufficient number of points of Internet ingress and egress. So, we also started side development on a feature we internally called the “super DAG”. Obviously, using that name as the actual feature name does not work well for localization and other reasons, so we asked Marketing to come up with a more professional name.

The Internet Database Availability Group

Today, we are announcing the Internet database availability group (iDAG). The iDAG is an extension to the database availability group (DAG) component built into Exchange 2010. An iDAG is a group of an unlimited number of Mailbox servers that hosts an unlimited number of databases and provides automatic recovery from datacenter-level failure events.

Once a DAG has been extended with an iDAG, the iDAG becomes the new boundary for mailbox database replication, database and server switchovers, failovers, and a new internal component called Internet Active Manager (iAM). iAM, which runs on every server in a DAG, manages switchovers and failovers for iDAGs. Will iAM enable better management and recovery scenarios for your iDAG? Absolutely!

The iDAG overcomes several limitations of the traditional DAG:

  • It can contain more than 16 Mailbox servers. In fact, there is no limit to the number of Mailbox servers that can be added to an iDAG.
  • It can contain more than 1600 databases. In fact, there is no limit to the number of mailbox databases that can be hosted on a Mailbox server in an iDAG.
  • It uses HTTPS instead of TCP for log shipping, making it much more firewall-friendly.
  • It provides automatic recovery from all failures.

Any server in an iDAG can host a copy of a mailbox database from any other server in the same iDAG.

Internet Database Availability Group Lifecycle

As with DAGs, iDAGs leverage the Exchange 2010 feature known as incremental deployment, which is the ability to deploy redundant Mailbox servers and databases after Exchange is installed. After you deploy Exchange 2010, you create an iDAG, add Mailbox servers to the iDAG, and then replicate mailbox databases between the iDAG members.

Note: It is supported to create an iDAG that contains a combination of physical Mailbox servers and virtualized Mailbox servers, provided that the servers and solution comply with the Exchange 2010 System Requirements. As with all high availability configurations, you must ensure that all Mailbox servers in the iDAG are sized appropriately to handle the necessary workload during scheduled or unscheduled outages.

An iDAG is created by using the New-InternetDatabaseAvailabilityGroup cmdlet. An iDAG is initially created as an empty Active Directory object. This directory object is used to store relevant information about the iDAG, such as server membership information, HTTP URL and location. When you add the first server to an iDAG, an Internet-based failover cluster is automatically created. This failover cluster is used exclusively by the iDAG, however the cluster does not need to be dedicated to the DAG. You can use the cluster for any other purpose you like.

In addition to a failover cluster being created, the infrastructure that monitors the servers for network or server failures is initiated. The failover cluster heartbeat mechanism and cluster database are then used to track and manage information about the iDAG that can change quickly, such as database mount status, replication status, and last mounted location.

During creation, an iDAG is given a unique name (that must start with the letter “i” in lower case, and assigned three static IPv6 addresses (for redundancy). Specify a comma-separated list of IP addresses by using the InternetDatabaseAvailabilityGroupIPAddresses parameter.

Consider an iDAG that will have three servers. Two servers (EX1 and EX2) are on the same subnet, and the third server (EX3) is on a different subnet.

New-InternetDatabaseAvailabilityGroup -Name iDAG1
-InternetDatabaseAvailabilityGroupIPAddresses 2001:0:4137:1f9a:1037:1d90:b3e4:3e79, 2001:4898:80a8:f019:d8d6:250e:baf0:9393
Add-InternetDatabaseAvailabilityGroupServer -Identity iDAG1 –Server EX1
Add-InternetDatabaseAvailabilityGroupServer -Identity iDAG1 –Server EX2
Add-InternetDatabaseAvailabilityGroupServer -Identity iDAG1 –Server EX3

The cluster for iDAG1 is created when EX1 is added to the iDAG. During cluster creation, the Add-InternetDatabaseAvailabilityGroupServer cmdlet retrieves the IP addresses configured for the iDAG and ignores the ones that don't match any of the subnets found on EX1. In this example, the cluster for iDAG1 is created with an IP address of 2001:0:4137:1f9a:1037:1d90:b3e4:3e79, and 2001:4898:80a8:f019:d8d6:250e:baf0:9393 is ignored.

Then, EX2 is added, and the Add-InternetDatabaseAvailabilityGroupServer cmdlet again retrieves the IP addresses configured for the iDAG. There are no changes to the cluster's IP addresses because EX2 is on the same subnet as EX1.

Then, EX3 is added, and the Add-InternetDatabaseAvailabilityGroupServer cmdlet again retrieves the IP addresses configured for the DAG. Because a subnet matching 2001:4898:80a8:f019:d8d6:250e:baf0:9393 is present on EX3, the 2001:4898:80a8:f019:d8d6:250e:baf0:9393 address is added as an IP address resource in the cluster group. In addition, an OR dependency for the Network Name resource for each IP address resource is automatically configured. The 2001:4898:80a8:f019:d8d6:250e:baf0:9393 address will be used by the cluster when the cluster group moves to EX3.

This process is then repeated an unlimited number of times to extend the iDAG further and further into cyberspace.

Windows failover clustering registers the IP addresses for the cluster in the Domain Name System (DNS) when the Network Name resource is brought online. In addition, a cluster network object (CNO) is created in Active Directory. The name, IP addresses and CNO for the cluster are used only internally by the system to secure the iDAG and for internal communication purposes. Administrators and end users don't need to interface with or connect to the iDAG name or IP address for any reason.

In addition to a name and one or more IP addresses, the iDAG is also configured to use a witness server, witness directory, alternate witness server, and alternate witness directory. The witness server and witness directory are either automatically specified by the system, or they can be manually specified by the administrator. The alternate witness server and directory must be specified manually by the administrator. In addition, multiple alternate witness servers can be configured for increased redundancy. In the event of a datacenter failover, the iDAG will automatically reconfigure the witness server and alternate witness server for you.

By default, an iDAG is designed to use the built-in continuous replication feature to replicate mailbox databases among servers in the iDAG. If you're using third-party data replication that supports the Third Party Replication API in Exchange 2010, you must create the iDAG for use with third-party replication mode by using the New-InternetDatabaseAvailabilityGroup cmdlet with the ThirdPartyReplication parameter. After this mode is enabled, it can't be disabled.

After the iDAG is created, Mailbox servers can be added to the iDAG. When the first server is added to the iDAG, a cluster is formed for use by the iDAG. iDAGs make limited use of Windows failover clustering technology, such as the cluster heartbeat, cluster networks, and the cluster database (for storing data that changes, such as database state changes from active to passive or vice versa, or from mounted to dismounted and vice versa). As each subsequent server is added to the iDAG, it's joined to the underlying cluster, the cluster's quorum model is automatically adjusted by the system, and the server is added to the iDAG Active Directory object in.

After Mailbox servers are added to an iDAG, you can configure a variety of iDAG properties, such as whether to use network encryption or network compression for database replication within the iDAG. You can also configure iDAG networks and create additional iDAG networks.

After you add members to an iDAG and configure the iDAG, the active mailbox databases on each server can be replicated to the other iDAG members. After you create mailbox database copies, you can monitor the health and status of the copies using a variety of built-in monitoring tools. In addition, you can perform database and server switchovers.

Internet Database Availability Group Quorum Models

Underneath every iDAG is a Windows failover cluster. Failover clusters use the concept of quorum, which uses a consensus of voters to ensure that only one subset of the cluster members (which could mean all members or a majority of members) is functioning at one time. Quorum isn't a new concept for Exchange 2010. Highly available Mailbox servers in previous versions of Exchange also use failover clustering and its concept of quorum.

But what is new is the iDAGs use of a new quorum model called the Internet Node Majority and File Share Witness Model (iNMFSW). The iNMFSW is similar to traditional quorum models that require a minimum number of votes to maintain quorum. However, the minimum number of votes needed for quorum in an iDAG is one. This is because, unlike traditional DAGs (which are limited to using voting members and a single witness server), an iDAG uses all Internet-capable devices to maintain quorum. So unless the Internet is down or otherwise unavailable, an iDAG can never lose quorum. If quorum is lost, however, administrator intervention will be required to correct the quorum problem and restore iDAG operations.

Using an Internet Database Availability Group for High Availability

To illustrate how an iDAG can provide high availability and site resilience for your mailbox databases, consider an iDAG with five hundred members.

In this example, the database copies aren't mirrored across each server, but rather spread across hundreds of servers scattered across the Internet. This ensures that no two servers in the iDAG have the same set of database copies, providing the iDAG with greater resilience to failures, including failures that occur while other components are unavailable as a result of regular maintenance.

Consider the following scenario, using the preceding example iDAG, which illustrates resilience to multiple database and server failures.

Initially, all databases and servers are healthy. You need to install some operating system updates on EX212. You perform a server switchover, which activates the database copies on another Mailbox server somewhere on the Internet. A server switchover moves all active mailbox database copies from their current server to one or more other Mailbox servers in the iDAG in preparation for a scheduled outage for the current server. You can perform a server switchover quickly by running the following command in the Exchange Management Shell.

Move-ActiveMailboxDatabase -Server EX212

In this example, all of the active mailbox databases on EX212 are moved. By omitting the ActivateOnServer parameter in the preceding command, you chose to have the system select the best possible new active copy.

While you perform maintenance on EX212, EX306 experiences a catastrophic hardware failure and goes offline. Prior to going offline, EX306 active database copies. To recover from the failure, the system automatically activates the copies on alternate servers in the iDAG within 30 seconds.

After the scheduled maintenance is completed for EX212, you bring the server online. As soon as EX212 is available, the other members of the iDAG are notified, and the database copies hosted on EX212 are automatically synchronized with the active copy of each database.

After the failed hardware component in EX306 is replaced with a new component, EX306 is brought online. After EX306 is available, the other members of the iDAG are notified, and the databases hosted on EX306 are automatically synchronized with the active copy of each database.

Using an Internet Database Availability Group for Site Resilience

In addition to providing high availability within a datacenter, an iDAG can also be extended an unlimited number of datacenters in a configuration that provides site resilience all datacenters. Incremental deployment can be used to extend any iDAG to any datacenter by deploying Mailbox servers and the necessary supporting resources.

Using Multiple Internet Database Availability Groups for Site Resilience

Prior to iDAGs, in order to achieve site resilience for multiple locations, you had to use multiple DAGs. With an iDAG, this is no longer necessary. A single iDAG can be extended across an unlimited number of datacenters, providing site resilience for all of your locations and databases. When using a single iDAG to provide site resilience in an environment where each datacenter to which you extend the DAG has an active user population, the Internet eliminates all single points of failure because quorum no longer requires a majority of the voters to be active and able to communicate with each other. It simply requires connectivity to another iDAG member.

Client Experience When Using Internet Database Availability Groups

As mentioned above, iDAGs can be used to provide both high availability and site resilience. The client experience when using an iDAG no longer depends on the type and version of the client and the protocol used by the client to access mailbox data. For example, if a datacenter failover occurs, the behavior and reconnection logic used by an Exchange ActiveSync, POP3, or IMAP4 client is the same as the behavior and reconnection logic used by Microsoft Outlook clients.