As you are aware, Windows Azure Hyper-V Recovery Manager can be used to protect your on-prem applications by orchestrating the protection and recovery of Hyper-V virtual machines in your datacenter. With Cloud protection configuration and placement algorithm to choose the right host in your protecting site – a lot of manual work which is hard to get right –is offloaded to HRM. You can now protect a large number of virtual machines through a few clicks and sit back relaxed as the replication continues and the latest data is synced to the protecting site. We are going to begin this series by looking at how recovery plans can make disaster recovery process for your applications a super easy experience.
Once you have protected the virtual machines in your datacenter you need to plan ahead for a quick recovery during a failover. Hyper-V Recovery Manager has many use cases and our VP Brad Anderson talks about them here. He also looks at the 100% failover success rate in his blog here. Essentially, all scenarios require you to execute a failover, be it Planned, Unplanned or part of a DR drill using the Test Failover. HRM recovery plans can be used to ensure that the failover is seamless, repeatable and automated. The top three needs for a recovery plan are
- Defining a group of virtual machines that failover together.
- Defining the dependencies between the virtual machines so that the application comes up correctly.
- Automating the recovery along with custom actions so that tasks other than failover of the virtual machines.
Assume that you have two datacenters, London and Amsterdam, each managed a VMM. Let’s say you have an Application used by the finance team that is based out of your office in London. Its serves a front-end that is a web interface to get the data. An application server runs the logic of processing the inputs and placing it into the back-end SQL database for later retrieval and report generation. Overall, the finance application is comprised of three Hyper-V virtual machines, ‘Finance Web Tier’, ‘Finance Middle Tier’ and ‘Finance SQL Tier’. You have protected the virtual machine by enabling HRM protection for them into a cloud in Amsterdam. The virtual machines are now replicating and their Replication status column in HRM management portal shows – “Protected – OK”.
Obviously, you want to failover the entire application and not the individual virtual machines. For that you create a recovery plan for the application. In the first step of the “Create Recovery Plan” wizard specify the protected VMM (say London) and the Protecting VMM Amsterdam as the two sites. Next you select the three virtual machines that are part of the Finance app. Finally name the plan – ‘Finance App Recovery’.
Default recovery plan
When the recovery plan gets created, it is a default plan. The first step in a recovery plan is the ‘All group shutdown’. This is the first step to be executed in a planned failover. To ensure that there is no data loss in the recovered virtual machines, this step safely shuts down all virtual machines before proceeding with the rest of the recovery plan. During execution if a virtual machine fails to shut down, the recovery plan execution is failed and halted as that is a critical failure.
By default, the recovery plan gets created with one group with the name – Group 1 : Failover. Expanding that shows all the three virtual machines are part of the group. All VMs of a Group failover in parallel. If you run this recovery plan all the virtual machines will be recovered to the protecting site and booted up.
Hyper-V Integration services
Where Hyper-V Replica ensures that the latest changes in the VHD are synchronized to the protecting site, HRM ensures that the virtual machines are correctly orchestrated to shutdown and boot at the right time. Hyper-V integration components are required for successfully carrying out the failover for the following reasons.
- It is required to safely shutdown the virtual machine.
- It is required to check whether the virtual machine has been booted up and is sending a normal heartbeat.
Safely shutting down the virtual machine ensures that the most recent data can be synchronized to the secondary site. Similarly, if there is an issue during booting, the recovery process can be stopped to ensure all failures are resolved before moving ahead. Make sure you have the latest integration services installed in the virtual machine to ensure zero data loss.
Let us begin customizing the recovery plan for a quick recovery.
Recovery plan with groups
Typically VMs of an application have a dependency on one another. The web front end does not work if there is no application server serving the web pages. The application server does not work unless there is a SQL backend ready to take queries and return data. To model these dependencies, the Recovery plan comes with a feature to add new groups. When you add a new group, it gets created with 0 virtual machines in it. For our example, let us create two new groups so that we will have three groups for three tiers. Move the ‘Finance Web Tier’ to Group 3 and ‘Finance Middle Tier’ to the Group 2 – the backend continues to remain in Group 1 because we want it to failover first. Below image shows a screenshot of the recovery plan with more than one group.
Let’s do a planned failover. The recovery plan Planned failover follows the below stages during its execution.
- Pre-requisite check
- Checks whether all virtual machines are in a ready to failover state.
- All groups shutdown stage
- All the virtual machines of this recovery plan are shut down. The order of shutdown is the reverse order of groups – group 3 shuts down first, next group 2, and then finally group 1.
- Groups failover stage
- Groups’ failover in increasing order of their numbers.
- VMs in a group failover in parallel.
Only after the virtual machines in a group have been successfully failed over and booted will the next group failover begin.
In a planned failover if any virtual machine fails to shutdown, the complete recovery plan halts execution to avoid data loss. User can manually resolve the problem that is stopping the virtual machine to shutdown and redo the failover. Similarly, if any VM fails in any step during failover, the complete recovery plan halts after completing the group of which the VM was a part of. When the failover is re-triggered, the execution skips all completed actions, marking them likewise.
Consider a situation when your primary site is – God forbid – disaster struck. You would like to quickly recover your application to the protecting site to keep your business up and running. Executing an unplanned failover will start recovering all the virtual machines to the secondary site. It might experience data loss because there is no guarantee that the latest changes in the virtual machine have been synchronized with the recovery site. If recovery of a virtual machine fails it does not halt the recovery plan. This is done since you are struck by a disaster and the recovery should continue as far as possible. Planned failover on the other hand ensures zero data loss and fails to execute successfully if the primary site is not available to synchronize the latest changes. Unplanned failover can be can be done with two variations as discussed below.
Unplanned failover without primary site operation
This can be triggered when the primary VMM is no longer reachable and the virtual machines cannot synchronize their latest changes with the protecting sites. Failover skips the shutdown step and begins the recovery directly. Failover of the recovery plan follows the dependency and fails over the groups one after another – however if there is any failure in recovery of a virtual machine, the recovery plan does not halt and continues to execute till it completes.
Unplanned failover with primary site action
In most cases of a real disaster, the production site is hit but not totally destroyed. There might be a chance to recover the latest data from the virtual machines. In this case the protected site VMM needs to be available and have connectivity to internet. If you execute the primary site operations in an unplanned failover, it will attempt to safely shutdown the virtual machines and also attempt to sync the latest data. If any step fails, it still continues to complete the recovery rest of the recovery plan.
Now that you have finished failover, you need to commit. This commits the virtual machine’s current recovery point and all other recovery points are deleted. After the failover is complete, the recovery plan is in ‘commit waiting’ state. A commit on the recovery plan commits all the VMs in the plan – which ensures that all the virtual machines’ extra recovery points are deleted in one click.
Failback? Reverse replication and failover again
Did you notice that there is no gesture called failback? Does that mean once you failover from a protected site to a protecting site, you cannot come back? The answer is obviously “Yes you can failback” – using the failover button. This makes the failover from London to Amsterdam and vice versa exactly symmetric. But before you failback you need to re-protect the VMs.
Once commit is completed, you can see that the recovery plan is now waiting for reverse replication.
‘Reverse replicate’ gesture on the recovery plan will begin tracking the changes in the failed over virtual machines and send the latest changes back to the old primary site. Once reverse replication is complete, the recovery plan is now ready for failover in the reverse direction – AKA Failback!
Notice in the below screenshot how the direction changes during failover and failback (failover in reverse direction).
We discussed how to model a simple recovery plan. In the next part of this series, we will see how to add custom actions to extend the recovery plan functionality.