Azure VM Backup: beware of Windows Server 2008 R2

Since March 2015 we have the possibility to backup and restore entire VMs running in Azure. If you were not aware of this before, have a look at the documentation here: https://azure.microsoft.com/en-us/documentation/services/backup/. Using the Backup Vault you can automatically backup full VMs on a flexible schedule. For Active Directory, you could set this to daily, with a retention time of 60 days.

Assuming that you know all this already, there is still a small gotcha that you should be aware of when these VMs are Domain Controllers. Azure IaaS is a sort of fancy hypervisor, and has potentially some of the same problems as regular hypervisors. For Active Directory, the most relevant one is USN Rollback. The one-line summary of this problem would be (apologies to any expert out there): when you restore a DC without telling it has been restored, it will mess up the AD replication administration in a big and unrecoverable way.

So how does a DC know it has been restored so that it can do the right thing? There are two ways:

  1. The backup has been made using a VSS-aware backup method. This is the normal, supported method, and the way it is supposed to go all the time. Azure calls this an application-consistent backup. When a VSS backup is restored, the DC will know.
  2. If VSS is not involved somehow (think disk images, snapshots), VMs have a fail-safe feature called VM-GenerationID. For this to work, we need two things to be true: the OS needs to be Windows Server 2012 or later, and the hypervisor must support it. Most modern hypervisors can do this, including our own Hyper-V, VMware ESX, and of course, Azure. With these two conditions in place, the VM-GenerationID mechanism will tell the DC when it has been restored even without VSS in play.

Now, Windows Server 2008 R2 is not aware of VM-GenerationID, and keeping that in mind let's have a look at Azure VM Backup using the Backup Vault. The backup event itself is orchestrated using the Azure VM Agent in cooperation with the Azure fabric. It will trigger a VSS snapshot in the VM when the backup begins, and when finished this snapshot is saved to the Backup Vault. This is the normal mode of operation, and all is well.

A different thing happens when the VM is not running. When the backup starts, the Azure fabric will notice that the VM is offline. Instead of trying a VSS backup first, it will simply copy the VM data (or to be exact, the difference with the previous backup). This is called a crash-consistent backup.

You spotted the problem, right? When you restore a crash-consistent backup of a 2012 R2 DC, nothing bad happens because the VM-GenerationID mechanism will solve the problem. But with a 2008 R2 DC, the VM will just be recovered and started without any notification to AD. Once the restored box starts talking to other DCs, the USN Rollback happens and the DC is toast. The following table summarizes the findings:

Operating System VM State Backup Type Triggers VSS? AD Restored OK?
2008 R2 Online Online App-Consistent Yes Yes, VSS does it.
2008 R2 Offline Offline Crash-Consistent No NO
2012 R2 Online Online App-Consistent Yes Yes, VSS does it.
2012 R2 Offline Offline Crash-Consistent No Yes, VM-GenerationID does it.

 

Management summary: if you backup Windows 2008 R2 virtual DCs running on Azure, make sure they are online during the backup window, or bad things will  happen during restore.