Tips for disaster recovery readiness


I have the pleasure of working for the Exchange Critical Situation team in North Carolina, so quite a few Disaster Recovery cases end up on my phone line.  The number one reason they end up there is that the company just has not planned for a failure.  We all assume that nothing is going to happen to us; (I am the same way, I live in hurricane country and yet I don’t have a hurricane kit or plan) but with Email and thus Exchange being mission critical to most businesses running it today, it is not an assumption any reasonable IT administrator can afford to make.


 


With that in mind I would like to share some basic things that everyone who is running Exchange needs to do in order to be prepared.  While a lot of this is not "brand new" knowledge, it does address many of the things that I see go wrong in the cases that I work on. DR Happens - Will you be ready?


 


Service Level Agreement


 


Here the common scenario we run into is that people have not thought about what is important if the Exchange server has a failure.  I will be on the phone working with a customer and we determine they need to do a restore from backup.  At this point (especially with 2003) there are a few ways we can go about doing that.  What is important to your company will help to determine the best way to do the restore. 


 


Not having this information decided in advance leads to long conversations with management to get the decisions made.  This can drastically slow down the pace of recovery.  In some rare cases I have ended up spending more time talking about what we could do then we spent actually doing it.


 


What you need to have decided ahead of time are a few simple questions:


 


1) Which one is more important to my users: Restoration of Mail Flow or Recovery of Historical Data?


2) How long can we afford to be down with out any Mail Flow?


3) How long can we afford to be down with no Historical Data Recovered?


4) If Historical Data is our top priority at what point does Mail Flow become more important and vice versa?


 


These four questions will help to define your options and what you can and cannot do in order to restore Exchange to the functional level that you desire in the minimal amount of time.  Having these decided in general is the first step to having a smooth disaster recovery.


 


Database Size


 


This is another common situation that I run into.  We are doing a restore from tape of a 120 Gb storage group and suddenly it is realized that the restore is going to take another 18 hours to finish and it is 11pm right now.  So it is going to cut into the business day and that can’t be allowed to happen.  Now we end up in a panic situation where people are willing to try any crazy scheme they can think of to get it back up before the morning.


 


This situation almost always comes about because people plan their database size based on their disk size and not the limitations of their Backup and Restore plan.  Database size should be determined almost solely by your SLA and your backup and restore speed.  This will ensure that when something goes wrong you will be able to get everything back up and running in a predictable and timely manner.


 


So what you need to do with database size is work it backwards.  Determined how long you can be without Historical Data.  Then determine how fast you can restore from tape.  Use those two numbers with some padding for troubleshooting when the failure is discovered and some padding for log file replay after the restore is done to determine how large your databases can be.


 


You also need to figure out if that number will hold when you have to restore a whole storage group of 5 databases or what if you have to restore a whole server of 20 databases?  In most cases you will probably want an SLA for each of those three situations. Since it clearly will take more time to restore 5 or 20 databases then it will to restore one.


 


Practice


 


Now let us say that you have been diligent and you have your SLA written out and you have your Databases at a reasonable size; you are all prepared right?  Wrong.  When it is time to do a disaster recovery mistakes are measured in hours.  Checking the wrong box on your backup software can cost you your entire SLA window.  Plus you don’t want to spend 30 minutes reading the directions for your Backup Software, or calling your backup vendor to figure out how to get the restore off of tape while you are offline.  You need to already be at least basically familiar with the restore process.


 


What you need to do is practice as if your Exchange server had failed.  We call this process running a Fire Drill.  You should run an Exchange Fire Drill at least once a Quarter to keep everyone up to date on how the restore process works and how to perform it.


 


To run a Fire Drill you should setup a server (beefy workstation) with sufficient drive space to accommodate the Exchange database from at least one of your servers.  You would then set it up on its own network with its own Domain Controller (if you are not testing full server restore then this can be a new domain).  Install Exchange to the server and your backup software and make sure you can get access to the data on tape.


 


Now you are ready to go.  Come in the next morning and declare “The Exchange server/Storage Group/Database (which ever you want to practice) just went down.  We need to get it back up and running we have “X” hours to do so.”  That X hours should be the time from your SLA that you have laid out before hand.  Also make sure that you have management involvement so that you can concentrate on doing the restore just as if the Exchange server was actually down.


 


Write a Cheat Sheet


 


Now you have gone thru the process of doing a Fire Drill and you learned what worked and what didn’t.  You have figured out all of the little check boxes and the fact that you have to keep the intern away from the tape drive power button.  Take all of the knowledge and the make yourself up a cheat sheet for next time.


 


This cheat sheet should contain an outline of the steps and processes that you need to go thru in order to do your planned restore.  It should include reminders of the little steps that you found are easy to miss.  If possible you should also include screen shots of all of the settings you need to have to do the restore on your backup software.  This cheat sheet will basically become your Restore Bible when it comes time for the real thing.


 


Practice some more


 


Last but not least you need to bring that cheat sheet out on a regular basis and practice with it.  Make sure your organization is doing an Exchange Fire Drill at least once a quarter.  Make sure that not just the Exchange guy is there for that, he should have a backup, in case he is on vacation, which can use the cheat sheet if necessary.  After each of these practice sessions go back over the cheat sheet and make sure nothing needs to be updated.


 


If you do these basic simple things you will be more prepared for when an Exchange Disaster does happen.  This should ensure that your disaster recovery goes smoothly with the minimum amount of down time.  With Disaster Recovery mistakes are measured in hours so it pays to be prepared.


 


References:


 


Exchange Server 2003 Disaster Recovery Operations Guide


http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/disrecopgde.mspx


 


Worksheet: Disaster Recovery Preparation for Exchange Server 2003


http://www.microsoft.com/technet/prodtechnol/exchange/2003/drchecklist.mspx


 


Preview: Exchange Server 2003 Disaster Recovery Planning Guide


http://www.microsoft.com/downloads/details.aspx?FamilyId=784BBEA2-28DD-409A-8368-F9914E993B28&displaylang=en


 


- Matthew Byrd

Comments (8)
  1. Unless an organization is made of money, the practice requirements (a beefy workstation, sizable disk, a domain controller, network infrastructure–and presumably licences for Win 2003 Server (x2) and Exchange(…right?…)) are not small.

    One way of mitigating this may be to use server virtualization software like VMware or your very own Virtual Server 2005. Could we have your thoughts on this.

    Also this is a perfect segway into another more serious issue/question (for me anyways and I’ve been searching high and low for an official answer):

    http://support.microsoft.com/Default.aspx?kbid=897614

    Quote: "Microsoft Exchange Server. Exchange is currently not supported running within Microsoft Virtual Server. Exchange will be supported within Virtual Server starting with Exchange 2003 Service Pack 2 and subsequent releases."

    Other than one line above saying that Exchange 2003 SP2 will be supported as a VM in Virtual Server 2005, I have seen no reference anywhere else talking about this issue, certainly not on any Exchange-related page.

    While I understand this blog may not be official, I’m looking for any quasi-official guidance on using Exchange 2003 SP2 on Virtual Server 2005. Certainly the right technical people are here on this blog. Some of us need to run Exchange 2003 on virtual hardware to consolidate servers. Thanks.

  2. Tim Jordan says:

    I just received some trial software on DVD including Exchange 2003. The package says its based on Virtual Machine 2005. I can test it and get back to you.

    Tim

  3. Rob Campbell says:

    Can an E2K3 Bridgehead (no mailboxes, but multiple virtual servers and connectors) be restored from just the AD information and a backup of the IIS Metabase (given that all the Windows server configuration settings match)?

  4. Adam Gates says:

    First steps for any "cheat sheet"

    1. Check you Application and System logs for anything that stands out. Most Exchange disaters dont just happen you can get a serious head start by doing this. If it is RED find out why and FIX it before it brings the server down.

    2. Check that your backups are running, successful, and available. If not FIX IT.

    3. Verify the hardware agreements and contact the vendor to upgrade to a faster response time if you need it. 24-72 hours is a LONG time to run on a degraded RAID 5 array better get that to at least 4 hours.

    Do steps 1 and 2 daily. Do step 3 ever 6 months.

  5. Vitrually Yours says:

    FYI,

    The first 2 of your three links point to your blog site instead of the address on show on the page… kinda like links in one of those fake Ebay Phish emails… ;-)

    You may want to correct this.

  6. Anonymous says:

    Please bear in mind that the ‘Exchange guy’ can also be female.

  7. Matthew Byrd says:

    Hi Virtually Yours,

    Remember that we are not trying to run the server we are just trying to simulate that the server has failed. A simple high end user machine with some added hard drives will do from a hardware standpoint. For the licenses; that I would not worry about as the server is never going to have any users connect to it. It will just be used in test to verify your backup and restore process. Also you can make it an all in one box DC/GC/E2k3 server … this does have some limitations but saves you having to have another piece of hardware to do the test.

    You can use virtualization in order to do this test. Virtual Server or PC would work fine along with any of the other virtualization software that is out there. What is comes down to from a Microsoft stand point is that we have not tested Exchange running in virtualization so we will not support it in production. We will do everything we can to help you get it working but if we believe for any reason that your issues is being caused by it running in a virtual server we will require you to install in on stand alone hardware in order to precede.

    Hopefully this has addressed your questions.

    -Matthew Byrd

  8. Matthew Byrd says:

    Hi Rob Campbell,

    For the recovery of the BH server that contains no mailboxes you only need to run setup /disasterrecovery for the Install and the Service Pack then start the server up with blank databases. No other steps are needed. All information about connectors in E2k3 are stored in AD. They will replicate down to the Metabase using DS2MB.

    -Matt

Comments are closed.

Skip to main content