In Part II we continue with more discussion and guidance on how to successfully replicate machines when bandwidth restrictions are imposed.
One of the most overlooked tasks in a project like this is how quickly can I fail back to my primary site when all is said and done! Windows Server 2012 takes this into consideration and allows for Reverse Replication automatically when a failback event occurs.
Now that we have a process to work from, if you missed Part 1 of the series go here, the process shown in that article works with a set of tools in the form of a spreadsheet which you can get here as well. Still need Server 2012 trial bits? Click here.
Realistic Expectations! Can we get there from here?
The spreadsheet snapshot seen below shows many common applications listed. We will pick on the Payroll Server in this example. Human Resources requires the payroll application to be available nearly 24/7 as they issue expense reimbursements nearly real time as well as the ever crucial paycheck processing throughout the month. In the scenario below we assume that the current infrastructure uses nightly tape backups hence the 24 hours RPO figure. We can also see that the HR group requires a maximum of 4 hours to return the system to operations, however with the current tape recovery process it takes nearly 6 hours to perform this task. So in short, it takes 6 hours to get the system back to the state it was in the night before when the backup was initially run. Of course this all hinges on a successful tape backup.
We then look to the right hand side of the spreadsheet at the proposed solution in this example using Hyper-V replica. Hyper-V replication allows for a virtual machine to be copied to another Hyper-V host first in a full copy then with incremental updates from that point forward. To circumvent a lengthy initial replication, Hyper-V allows for "seeding the target host," which in essence means that an administrator can take a copy of a virtual machine to a portable disk drive, send it to the recovery site, paste the virtual machine on to the target host, then setup the ongoing replication so that only the virtual machine changes are sent going forward. For now we will assume that we have proper bandwidth to execute hourly replication, meaning we have researched the amount of actual change that occurs in this virtual machine on an hourly basis and believe that this amount of data change can be successfully sent over the WAN circuit to the recovery site. I will cover bandwidth considerations in more detail later in this article. So in this scenario we far exceed the requirements given by the HR department. However we are only talking about a single application thus far.
When looking at the bigger picture we have several applications that we need to involve in this planning exercise. By meeting with the proper members of each team responsible, this spreadsheet can be used going forward as reference to the requirements, as well as who agreed to the terms proposed. This is quite important because when a system crashes or site goes offline for any reason, it is up to us to get them back up and running. Now let’s take a look at bandwidth considerations before we dive into the bits and bytes of how to put this plan into action.
Wide area network speeds are most often the biggest bottleneck in a replication scenario. If a company has the DR site connected via a standard T1 then they are looking at 1.54 Mb/sec in raw throughput. However, given the protocol stack, and application overhead, we will most likely never fully realize that speed. Built into the spreadsheet on the second tab is a bandwidth calculator which we will use to run numbers for our different scenarios. In the scenario below we will assume that the average amount of data that changes per hour is 80 Megabytes per given VM workload. For ways to determine this figure for a given application I will cover this topic later in the article. From this exercise we would see that given a standard T1 WAN circuit, with 20% overhead, it will take Hyper-V approximately 8.3 minutes to replicate the data changes. Then we can see on the right hand side, that if we have 3 workloads with the same average data set it will take around a half an hour to complete:
Most real life scenarios will get a little more involved of course, but this is a good starting point to work from when planning for replication. In order for this process to work we now need to move on to discovering that data change rate in your environment. For those of you without virtual machines running in Windows Server 2012 Hyper-V, you may need to investigate third party tools or backup reports given that incremental data set information is available to you. For the example to follow I will be working with Windows Server 2012.
To begin this exercise we must first build two Hyper-V enabled Windows 2012 Servers joined to the same domain. Windows Server 2012 can be downloaded here. For this exercise we will be expanding upon the replication lab instructions found here. In production, the initial replication of the virtual machine will most likely use the process known as “seeding the target host.” What this means is that we will take a copy of the current virtual machine as it stands today, move the copy to our secondary site via physical shipment, then setup ongoing replication to occur over the WAN circuit. In this planning phase I recommend testing the replication to a local server given that the same bandwidth restrictions are put in place(see later in this article for more info on restricting bandwidth). After the initial replication is synced up, you can view the Replication Health by right clicking the VM from Hyper-V Manager, select Replication, then Replication Health. Inside this window is a number representing the average replication size as well as other pertinent info:
For Powershell users, the following command can also be utilized:
This should list out the AvgReplSize in megabytes. If you monitor this number throughout the day it will give you a rough idea of just how much data is changing inside the VM that would be traversing the wire. Take this average number and place it into the Bandwidth spreadsheet tab next to the field labeled “Data size (MB)”. Make sure to also correct the field labeled “Mb/Sec Bandwidth” with the circuit size representative of your connection to the secondary site. If you know that several of the machines which are targeted for this replication project are similar in data change rates, plug in that number in the field labeled “Number of Workloads” and look at the total time it would take to replicate the machines. This number is a guestimate but close enough for planning purposes.
Bandwidth Restrictions - Advanced Configuration
Now let’s move on to a more advanced topic. Let’s suppose that you have many machines that will be replicating across the WAN circuit, but this WAN circuit is not limited to only replication traffic. In other words, user traffic or site-to-site traffic may need priority during the day or specific hours of every day. So let’s setup the process so that replication is only running when necessary. As seen below you have the ability to replicate at different intervals of time via the GUI options, where as using Powershell would give the greatest flexibility, stay tuned to future blogs on advanced Powershell replication configurations.
Another method for limiting the bandwidth allowance is to set QoS restrictions on the given port number at the edge router or alternatively at the Hyper-V host level. Since companies have a wide range of routers in use, I will stick to an example that can actually apply to anyone. Next we need to do some math.
Let’s say you have a T-1 that on average has around 900Kb of bandwidth available. We need to allow for some extra room for the priority traffic so let’s assume 800Kb is available for replication traffic. How does this number compare to the figures you discovered earlier? If 800Kb seems doable, then we need to limit our network to only use up to 800Kb total across all Hyper-V source hosts. Let’s assume you have 4 hosts replicating a number of VMs. We take the 800Kb and divide it by 4, so each host should be limited to 200Kb. I would suggest that you change the TCP port utilized for replication in the Hyper-V settings to a more unique value in your environment other than port 80. This can be done by editing the Settings for Hyper-V Replica within Hyper-V Manager. In the example below I use port 9999. Now we need to set the QoS rule so that any traffic driven on the host will be restricted to 200Kb, the Powershell command for doing so is listed here:
New-NetQosPolicy “Replication Traffic to 9999” –DestinationPort 9999 -ThrottleRateActionBytesPerSecond 25822
Again, this command should be issued on each host delivering replication. Given this scenario, our Bandwidth calculations would look like this:
Now we need to look at the “Total Transfer Time” field and see if this gets us to where we want to be with the Recovery Point Objective. Notice that it now takes over 3 hours for the same workloads to replicate. Since we know that with Hyper-V Replication the RTO is under a few minutes, we need not worry with this number any longer, other than filling in the “Proposed” area with the more successful time frames. Recovery Point Objective values are what truly matter going forward when you have the ability to failover and failback within minutes in Hyper-V Replica. Fine tune the environment to maximize the RPO, while sticking within the budgetary boundaries for bandwidth, and then go back to your application owners and explain the new best case scenarios.
As always, these tips and tricks are meant for guidance and process, actual results will vary based on hardware and circuit types. For more information on installing Windows 2012 Hyper-V and other tips and tricks please visit my blog.