High Availability in Exchange Server 2010


Now that Exchange Server 2010 has had its first birthday, it's a good time to remind folks about the built-in features for high availability, site resilience and disaster recovery in Exchange 2010. If you're already running Exchange 2010, then you probably already know about database availability groups, mailbox database copies, and Active Manager. But if you're running Exchange Server 2007 or Exchange Server 2003, there will be new concepts and technology with new benefits for your organization as you upgrade to Exchange 2010, such as incremental deployment, datacenter switchovers, and recovery databases.

Building on the native replication capabilities introduced in Exchange Server 2007, Exchange 2010 integrates high availability into the core architecture of Exchange, enabling customers of all sizes and in all segments to economically deploy a messaging continuity service in their organization. Exchange 2010 reduces the cost and complexity of deploying a highly available and site resilient messaging solution while providing higher levels of end-to-end availability, simplifying administration, and supporting large mailboxes.

In previous versions of Exchange, service availability for the Mailbox server role was achieved by deploying Exchange in a Windows failover cluster. To deploy Exchange in a cluster, you had to first build a failover cluster, and then install the Exchange program files. This process created a special Mailbox server called a clustered Mailbox server (or Exchange Virtual Server prior to Exchange 2007). If you had already installed the Exchange program files on a non-clustered server and you decided you wanted high availability, you had to build a cluster using new hardware, or rebuild the existing server by removing Exchange, installing failover clustering, and reinstalling Exchange.

Exchange 2010 introduces the concept of incremental deployment, which enables you to deploy service and data availability for all Mailbox servers and databases after Exchange is installed. Service and data redundancy is achieved by using new features in Exchange 2010 such as database availability groups and mailbox database copies. In Exchange 2010, the days of building clusters and clustered mailbox servers, and the complexity that goes with those tasks, are gone. Mailbox servers can be added to a database availability group and mailbox databases hosted on those servers can be replicated across the servers to provide automatic recovery at the mailbox database level instead of at the server level. Fast database-level failover times (<30 seconds) that are transparent to end-users, and the capability to switch between database copies, can dramatically improve an organization's uptime. Organizations can now deploy a fully redundant Exchange organization with as few as just two Exchange servers, and benefit from database-level failovers. Customers benefit from automatic, database-level failover capabilities without having to become experts in Windows failover clustering.

Moreover, you can add site resilience to your existing high availability deployments with less complexity by simply extending database availability group across multiple physical locations (for example, primary and standby datacenters). By combining the native site resilience capabilities in Exchange 2010 with proper planning, a standby datacenter can be rapidly activated to serve a failed datacenter's clients. In the event of a disaster affecting your primary datacenter, you can use the built-in Exchange PowerShell cmdlets for site resilience to quickly perform a datacenter switchover to move the Exchange service namespaces and data endpoints from the primary datacenter to the standby datacenter. This transition is seamless for end-users; they don't need to use separate accounts, maintain multiple passwords or learn a new URL. They use the same URLs and namespaces as in the primary datacenter; they use the same account as in the primary datacenter, and they are accessing the same data as in the primary datacenter.

There's a lot of information out there that will help you plan, design and manage your high availability or site resilience solution. For example, you might want to start with this four-part video blogcast on high availability in Exchange 2010:

Next, visit the Exchange 2010 library in TechNet, where you can read topics that will help you understand high availability and site resilience in Exchange 2010. When you're ready, check out these TechEd presentations for deeper technical content:

Of course, as with previous versions of Exchange, Exchange Server 2010 also includes a rich backup, restore and disaster recovery feature set. For example, like Exchange 2007, Exchange 2010 supports VSS backups of mailbox databases, including both active and passive mailbox database copies. Exchange 2010 also includes support for a "recovery database" (previously called a recovery storage group in Exchange 2007).

But unlike any previous version of Exchange, Exchange 2010 also introduces several new features and core changes that, when deployed and configured correctly, can provide native data protection that eliminates the need to make traditional backups of your data. Using database availability groups to minimize downtime and data loss in the event of a disaster can also reduce the total cost of ownership of the messaging system. And by combining DAGs with other built-in features, such as Single Item recovery, organizations can reduce or eliminate their dependency on traditional point-in-time backups and reduce the associated costs. For more information on these features, see these resources:

To read all available library content for Exchange 2010 high availability and site resilience, see High Availability and Site Resilience in Exchange 2010.

Finally, to read more about continuity services, check out today's post on the UC Blog.

Scott Schnoll


Comments (24)
  1. camlost says:

    after a year, it is the right time to ask a simple question:

    what about high availbaility of public folders? is it provided by Exchange 2010 for all supported clients? (i.e. even for Outlook 2003 and 2007)

  2. Bharat Suneja [MSFT] says:

    Public Folders have had high availability built-in since their inception – using
    Public Folder replication.

    New High Availability and Site Resilience Functionality

    http://technet.microsoft.com/en-us/library/dd335211.aspx

    * Database copies are for mailbox databases only. For redundancy and high availability of public folder databases, we recommend that you use public folder replication. Unlike CCR, where multiple copies of a public folder database
    couldn’t exist in the same cluster, you can use public folder replication to replicate public folder databases between servers in a DAG.

  3. Frank T says:

    re: incremental deployment

    Wouldn’t we have to upgrade Windows Standard to Windows Enterprise on the server? Is this supported in-place? Otherwise, it’s not so incremental unless you’ve deployed Enterprise to your standalone mailbox servers for some reason.

  4. bday says:

    @Fred, in-place upgrades from Windows Std. to Ent. aren’t supported, but if you know HA is in your plans long-term then you can deploy Windows Server Enterprise Edition from the start. Some shops also only deploy Windows Server Enterprise Edition on all machines because they are taking advantage of Datacenter Edition licensing in fully virtualized environments.

  5. bday says:

    Uh yeah, that should have been to @Frank. :)

  6. camlost says:

    to Bharat Suneja [MSFT]:

    have you ever tried to switch off the database used by your clients? obviously not. otherwise you wouldn’t ever talk about HA in case of PFs.

    Outlook tries to connect to PF DB even if it’s not available. the interim update from October  (not sure about the date but it’s not so important) doesn’t provide solution – both Outlook 2003 and 2007 keep connecting to the unavailable server.

    how do you want to achieve HA if your clients are not able to connect to the available replica? not mentioning replication gets broken because of another HUGE bugs. (try to forward digitally signed messages with attachments – the interim update doesn’t fix it for all cases. it can be easily reproduced.)

  7. Scott  – Greatly appriciated for the videos. Thanks.

    Yes, DAG really rocks!

  8. Scott says:

    I’ve read in documentation that DAG members must be a part of the same AD domain.  Do AD sub domains count?  Such As:

    DAG Member 1 is in   domain.com

    DAG Member 2 is in   subdomain.domain.com

    Would this scenario work?

  9. Exchange says:

    @Scott – That scenario would not work, as all members must be in the same domain.  A subdomain is a different domain; subdomain.domain.com is not the same as domain.com.

  10. bday says:

    @Scott (the non-MS one): This is a Windows Server requirement, they must be the same AD domain. A child domain is considered a "different" domain. See "domain role" here; http://technet.microsoft.com/en-us/library/dd197454(WS.10).aspx#BKMK_Software_Requirements

  11. Avinash Lewis says:

    Hi all,

    Im happy that in Exchange 2010, The codeing team has been reducing support for public folders, As in exchnage 2003/2000 it was a tread of SA(system attendent), and any issue in PF would bring the whole system down and it was a headache for Exchange admins, and not to forget the free-busy and whole lot of things connected to PF…!

    Thanks

    Avinash Lewis

  12. johnredd says:

    Single Item recovery isn’t very helpful in the case of folder deletion.  Say if someone premanently deletes a folder with many sub folders.  Each mail item will show up under recover deleted items with no link to the original folder.  You can only restore back to "deleted items".  Non outlook clients will default to a hard delete so I’ve seen this happen quite often with folders.  A good backup is still needed, preferably one that allows you to restore a mailbox to a point in time.

  13. dageeza says:

    Installed Exchange 2010 SP1 Rollup 2 on one of the CAS servers in the CAS Array yesterday expecting the clients (OL2007) to be unaffected, but in fact they stayed disconnected for the time the services were down being patched – about 5 minutes. Clients are in NON cached mode. This caused a major problem and lack of trust in the product.

    I am suspecting that the CAS Array is set up fine, as is the client referral (checked all MBX DBs have the correct CAS Array assignment for RPCCLientAccess), and that the problem here is the way Windows NLB works. It is set up as recommended with source IP affinity for RPC connections, but what I suspect happens is that because the server still responds – i.e. it is only the SERVICES which are down and NOT the server, the client affinity persists and so the client shows as disconnected – even though it is still connected to that server in the CAS Array – does that sound reasonable ?

    If this IS the case, is there any workround other than reconfiguring the whole array to move the server being patched out (which I suspect will require downtime anyway) ?

    I am going to test this theory next week by seeing if the same clients have the same problem if I physically down the server in the CAS Array – in which case NLB should reconfigure to point to another server, so maybe 30 second interruption maximum ?

    Anyone else experienced this ?

  14. dageeza says:

    Update to last comment.

    We found an issue with NLB and corrected it. Now we have failover in the event of a complete node failure, however, regarding failure of the RPCClientAccess service, this (strangely) will failover from node A to node B, but NOT vice versa. Can the team give us some in-depth on how this failover occurs – is there a PREFERRED node for this service to come online ? Thanks

  15. dageeza says:

    Deployed E2K10 SP1 UR2 on all servers, dismounted the PF database which is set in the MailboxDatabases for ClientAccess, and nobody able to access the PFs – as was the case before the rollup – in other words, this has not worked. Can the team give some insight into this – has anybody managed to get this to work ? What is supposed to happen with the new functionality ?

    Thanks

  16. Exchange says:

    Dageeza –

    What you are seeing is by design and is the same behavior that existed in previous versions of Exchange.  What initiates the client to use the OpenFlags.AlternateServer routine is receiving the error "MAPI_E_NETWORK_ERROR".  In your scenario, by simply dismounting the public folder store, the client is not getting the network error, but ecLogonFailed instead. Because it is hard to tell (on the client) if the ecLogonFailed is due to the PF store being dismounted vs. some other error that won’t be solved by going to a different PF server, the alternate server routing is not initiated.

    Ross

  17. Exchange says:

    Dageeza,

    With respect to your questions regarding upgrading CAS members in a load balanced array, you should be following these steps (note I wrote these for RTM to SP1, but same steps apply for rollups):

    PREREQ: Do not upgrade any mailbox servers to SP1 before all CAS within the AD site are upgraded to SP1.

    1.  Drain stop X (RTM) where X is a number of CAS members such that you don’t impact current CAS array load (e.g., if you have 6 CAS members in the LB array and designed the iimplementation such that you can survive 3 simultaneous CAS failures, then x=3 or less).

    2.  Take X servers out of LB pool after verifying all current connections have been terminated.

    3   Upgrade x servers to SP1.  Verify upgrade is successful.

    4   Drain stop y (RTM) servers (where Y = total number of CAS members in LB array – X) and add X (SP1) servers back into the LB array.

    5.  Confirm all connections on Y (RTM) servers have been terminated and then remove Y (RTM) servers from LB array.

    6.  Upgrade y (RTM) servers to SP1.  Verify upgrade is successful.

    7.   Add y (SP1) servers back to array.

    Ross

  18. dageeza says:

    To those who answered thank you very much. To Exchange first point about failover PF.

    Can you advise what exactly will cause this mechanism to come into play – what kind of failure can I create to test this ? Can you advise how this functionality works please ?

    Regarding second point – failover works fine now if the node is down. Failover also works in one direction only for the RPCClientAccess service – can you advise how this functionality works and how to troubleshoot this – should it allow for failover in this circumstance ?

    Thanks again

  19. dageeza says:

    Hi Exchange

    Thanks for the info about upgrading CAS servers – if you only have two in the array, and you do this out of hours, I guess you can skip the drain stop part ? We have upgraded with no ill effects.

    I thought I would return the favour and provide a step by step deployment when you are running ForeFront for Exchange. I have fully tested this so it is based on successful experience. Obviously for the HT/CAS, your procedure can now substitute that which I used.

    Procedure for installation

    1. Move all MailboxDatabases to a desired alternative target server**

    2. Ensure both Copies and Search indexes are in Healthy state, that there are zero logs to be copied, and the latest copy timestamp is current.**

    3. Run the following script to put the server into Maintenance mode :-

    “C:Program FilesMicrosoftExchange ServerV14ScriptsStartDagServerMaintenance.ps1”**

    (Note that you may receive a warning that this will only run when there are multiple copies of the MailboxDatabase – ignore this and continue running the script)

    4. Disable CRL checking (Internet ExplorerToolsInternet OptionsAdvancedSecurityCheck for Publisher’s Certificate Revocation – untick box)

    5. Stop ForeFront services . Open command prompt with elevated Administrator privileges.

    Exchange HT/CAS

    Path  to “C:Program Files (x86)Microsoft Forefront Protection for Exchange Server”

    Run fscutility /disable

    At prompt, run Net Stop FSCController

    Again run fscutility /disable

    There will be a warning then confirmation that the MSTransport service link has been removed – Forefront is now disabled correctly

    Exchange MBX

    Path  to “C:Program Files (x86)Microsoft Forefront Protection for Exchange Server”

    Run fscutility /disable

    At prompt, run Net Stop MSExchangeIS

    Again run fscutility /disable

    At prompt, run Net Stop FSCController

    Again run fscutility /disable

    There will be a warning then confirmation that the MSExchangeIS service link has been removed – Forefront is now disabled correctly

    6. Copy rollup pack to local machine (this should have been done in advance – location on each machine is “C:SPs_PatchesUpdate_Rollups<Rollup_Number>”)

    7. Open cmd with administrator elevated privileges or reuse Window already opened

    8. Install update –next assumes this installs without any problems

    9. (Optional but recommended) Reboot

    10. Check event logs

    11. Check services started correctly (Test-ServiceHealth)

    12. Restart ForeFront services (fscutility /enable)

    13. Re-enable CRL checking

    14. Run the following script to take the server out of Maintenance mode :-

    “C:Program FilesMicrosoftExchange ServerV14ScriptsStopDagServerMaintenance.ps1”**

    15. Check MailboxDatabaseCopyStatus is normal **

    16. Check event logs

    17. Failover MailboxDatabases to this server (optional)**

    18. Monitor for any issues

    ** DAG Members only

  20. Survey says:

    Hi team,

    i would like to know if following DAG scenario is supported:

    1 AD Forest/domain

    1 Exchange Org + 1 DAG

    EUROPEAN SITE: 2 DC + 1 EXCHANGE 2010 HUB-CAS-MBX with 500 mailboxes ( 1 active copy + 2 passive copy ) + FSW

    AMERICAN SITE:2 DC + 1 EXCHANGE 2010 HUB-CAS-MBX with 500 mailboxes ( 1 active copy + 2 passive copy )

    ASIAN SITE:2 DC + 1 EXCHANGE 2010 HUB-CAS-MBX with 500 mailboxes ( 1 active copy + 2 passive copy )

    Tx a lot.

  21. Exchange says:

    @Survey: A three-member DAG would not use a witness server.  Other than that, yes you can have DAG members in multiple sites, and those members can host a combination of active and passive mailbox databases, as configured by the administrator.  Note, however, that you would not have a true HA environment, as you only have a single CAS/Hub in each site.  For true Exchange HA, you need multiple instances of CAS and Hub in each site where HA is needed, and you’ll need some form of load balancing in front of your CAS servers.

    -Scott

  22. dageeza says:

    Hi Survey

    Correct me if I am wrong, but you are considering the databases as parts of the server environment in your DAGs ? You need to consider the mailboxdatabases as your failover units, NOT the servers. When you consider it like that, it does not really matter where the "live" copy is, what matters is how many copies you have. Scott is right that you will need to set up a CAS array for true HA, and ensure you repoint your mailboxdatabases to the CAS array instead of (as now) the individual CAS server. If you make the boxes you use host both CAS and HT roles, the HT load balancing and failover will take care of itself. Remember also to create a DNS record which points to your CAS array. Lastly, take not of the earlier comments about PF database faiover (if using PFs). You will need to create replicas on alternative Public Folder databases. If you then suffer any of the condition mentioned earlier, failover should be automatic, but if you simply make the primary PF database unavailable (e.g. dismount it), you will need to manually change over the PF database configured for each mailbox database. HTH

  23. dageeza says:

    One last thing to remember, there are additional considerations in your case of different sites in terms of setting up automatic failover. I cannot locate it now, but there is a great Microsoft article which explains how to set up the automatic faiover options when your DAG covers multiple geographic sites.

Comments are closed.

Skip to main content