Disaster Recovery for the Microsoft Cloud Platform System

We are glad to host Rochak Mittal, Senior Program Manager in the Azure Site Recovery (ASR) team. ASR provides disaster recovery capabilities in CPS. Rochak walks us through why DR matters and what the steps involved are.


Most organizations consider disaster recovery (DR) a complicated process that is troublesome to manage and test. DR planning puts CIOs in a difficult position: most realize that ensuring application continuity in the case of a disaster is valuable to the business, but balk at the cost of the additional hardware and software needed, not to mention the time required to develop and test DR plans.

Understanding the business value of disaster recovery, in Microsoft Cloud Platform System (CPS) we integrated a DR solution based on Azure Site Recovery, allowing users to create a disaster recovery plan between two CPS stamps. Service providers and enterprises that own multiple datacenters can avail themselves of this service with CPS at no additional cost.

With ASR, disaster recovery is simplified and incorporated into the overall design of the system, keeping both tenant and management systems highly available. The Azure service uses only management metadata to structure the recovery. Tenant and management data is transferred directly from the main to the DR site, and never goes to Azure. ASR plans orchestrate the recovery of resources at a designated site (see figure). ASR further simplifies the disaster recovery process by enabling testing of failovers and restorations of systems.

CPS Site-to-Site recovery using ASR

I am pleased to share that as part of Update 2 of CPS (coming later this year), we will extend CPS disaster recovery capabilities to be able to use Microsoft Public Azure as the recovery site. Leveraging this capability will enable enterprises and service providers to avoid the expense of acquiring and managing a second CPS stamp for recovery. We will share more details of this closer to its release in August 2015.

In the rest of this post, we will look at how you can configure two CPS racks to offer managed DR services to you customers with minimum configuration and user training. We will assume that you have two CPS stamps and an Azure subscription.

Configuring Two CPS stamps for DR

Initial setup

As a CPS admin, the first step to offer DR is to register for the ASR service and connect your CPS stamp to it. This only requires a few steps:

Once the initial setup is complete, you are ready to roll out DR plans and leverage ASR capabilities that include automated protection, asynchronous replication and orderly recovery of the virtual workloads.

Adding DR capabilities to service plans

To offer DR to your tenants, you need to have a published service plan in Windows Azure Pack (WAP), and link a DR add-on to it. You can use an existing service plan, or create and publish a new one for this purpose.

Next, you will create a corresponding private plan  on the secondary stamp. In WAP, a private plan is a hidden plan, visible only to administrators. Tenants cannot view a private plan or subscribe to it; administrators, however, can add subscriptions for tenants to private plans.

This private plan is the one that ensures that tenant’s subscriptions have the exact same services and offerings on the secondary (or recovery) site. ASR automatically adds your tenants’ subscriptions from the primary plan to the private plan on the secondary CPS, which allows you to provide a consistent and seamless experience to tenants across both datacenters.

It is important to note that the name of the private plan should start with primary plan name followed by the suffix that could be anything but it would be recommended to use “ -Recovery” for ease of identification. This naming convention needs to be followed, otherwise protection will fail.

Automating VM protection

Once the tenant has subscribed to the plan, VM protection can be automated using a set of ASR runbooks provided with CPS. The runbooks automate two tasks:

  • Detecting subscriptions with DR-enabled plan on primary Azure Pack admin portal and adding a copy of that subscription to the secondary Azure Pack private plan.
  • Enabling protection for the tenant virtual machines (taking away the pain of manually enabling protection for each tenant VM) and replicate all the virtual machines to the recovery stamp.

Only the master runbook, named “Invoke-AzureSiteRecoveryProtectionJob.ps1” needs to be configured and scheduled. The other runbooks are invoked by the master runbook. They query tenant subscriptions, enable protection and add copies of subscriptions from the primary to the secondary WAP admin portal. The CPS admin will schedule execution of the master runbook, defining frequency and time. The complete details of the runbooks and the variables need to run them can be found in the CPS administrator guide or on the Microsoft Script Center.

Once the ASR runbooks have executed successfully, protection for tenants’ virtual machines is automatic, and the status is visible in the WAP portal.

Note: Tenants will need to have user accounts on the secondary CPS stamp to manage their VMs after a failover. User accounts will not be added automatically by the runbooks to the secondary CPS. You can use technologies like Active Directory Federation Services (ADFS) to synchronize user accounts between the two CPS stamps.

Onboarding Tenants

At this point, tenants can start enabling DR for their VMs using the WAP tenant portal. Tenants will subscribe to a DR Plan by going to the tenant portal account and signing up for the new plan. Once this is done, they can add the DR add-on to their subscription.

Performing Failover in the ASR portal

Through the Azure Site Recovery portal, CPS admins can monitor jobs and server status, and manage both planned (test) and unplanned failovers for customer applications. Administrators can take advantage of functionality like Recovery Plan, Test Failover, and other failover operations in the ASR portal to offer optimum Recovery Point Objective (RPO) and Recovery Time Objective to their customers.

Performing test failovers at regular intervals is a best practice. It gives you confidence that your disaster protection is working as you intend.

Accessing VMs after a Failover

After a failover, the CPS admin will share the link to the WAP tenant portal for the secondary CPS stamp with tenants. Remember that, as we noted above, you need to make sure that user accounts have been replicated to the secondary site. Tenants will be able to log in and access their virtual machines in the exact same way as they did on the primary CPS Azure Pack portal – their experience is going to be completely consistent.

Conclusion

Disaster recovery is a key capability for business applications. CPS provides an integrated way of protecting tenant VMs and ensuring business continuity in case of a disaster. With Update 2, available later this year, we will provide additional flexibility with the ability to use Azure as the recovery site. I will be back for a closer look at that scenario as soon as we release it.