How Microsoft Achieves 99.99% Uptime with Exchange 2003

I've been presenting sessions on clustering Exchange Server at various Microsoft and third-party events for many years. After each presentation, at least one person has asked how Microsoft clusters Exchange, and how we maintain our 99.99% uptime. So, here's how we run Exchange 2003 at Microsoft:

Our internal IT department, Microsoft IT (MSIT) achieves high availability for its mission-critical e-mail infrastructure by utilizing systems and processes that are specifically designed for high availability, including:

  • Microsoft Windows Server 2003
  • Exchange Server 2003
  • Microsoft Operations Manager 2005
  • Highly redundant and fault-tolerant hardware
  • Two clustering technologies — server clusters and network load balancing
  • High availability concepts and practices based on Microsoft Operations Framework (MOF)
  • Ensuring the people who manage mission-critical systems and applications have the required knowledge and experience

For MSIT, the right combination of the above components helps to minimize planned downtime for maintenance or service pack installations. Unplanned downtime, whether caused by a server or other hardware failure, has also been reduced.

 

At Microsoft, the processes and technoligies that go into achieving 99.99% uptime can be broken down into three categories: Configuration (and Practices), Architecture, and Operations.

 

Configuration/Practices

  • MSIT has implemented Exchange 2003 on clustered Windows 2003 servers that are attached to SAN enclosures.
  • Strict SLAs are in place that include weekly service reviews. These reviews not only address whether system performance is within the limits set by the SLA, but they also continue to assess whether the right things are being measured, and whether particular measurements are contributing to improvement. Finally, the reviews also cover important statistics and trends, comparing current numbers to previous measurements. Regular reviews are VERY important to achieving high availability.

Architecture

  • Mailboxes are hosted on seven-node Windows Server 2003 clusters running Exchange Server 2003 with Service Pack 2.
  • Mailbox servers typically carry the full load of 20 databases, servicing approximately 4,000 mailboxes
  • Users run Microsoft Office Outlook® 2003 with Service Pack 2 configured to use Exchange Cached Mode

Operations

  • MSIT personnel centrally monitor all Exchange servers 24×7×365 using Microsoft Operations Manager
  • High availability is enhanced by Windows Server support for the following advanced backup/restore technologies:
    • Volume ShadowCopy Service (VSS) - MSIT uses VSS and clone backups to take volume shadow copies of Exchange 2003 databases and transaction log files. By using VSS, MSIT can restore databases within minutes, regardless of database size.
    • Recovery Storage Groups - This feature allows MSIT to mount a second copy of an Exchange mailbox database on the same server as the original database, or on any other Exchange server in the same Exchange administrative group. This reduces restore times for individual mailboxes or complete databases to just minutes.

Of course, 99.99% uptime would not be achievable without continous learning, as well as adhering to and evolving of best practices. Maintaining high availability is an active, ongoing endeavor requiring continuous response and adaptation to new situations and technologies. All system measurements are regularly reviewed by management, and action must be taken to keep actual performance within the tolerances set.

If you're interested in designing or deploying your own highly available Exchange 2003-based messaging infrastructure, check out the Exchange Server 2003 High Availability Guide.