Sizing Volumes for Data Deduplication in Windows Server

Introduction

One of the most common questions the Data Deduplication team gets seems deceptively simple: "How big can I make my deduplicated volumes?"

The short answer is: It depends on your hardware and workload.

A slightly longer answer is: It depends primarily on how much and how frequently the data on the volume changes, and on the data access throughput rates of the disk storage subsystem.

The Data Deduplication feature in Windows Server performs a number of IO and compute intensive operations. In most deployments, deduplication operates in the background or on a daily schedule on that day's new or modified data (i.e. data "churn"). As long as deduplication is able to optimize all of the data churn on a daily basis, the volume size will work for deduplication. On the other hand, we've seen customers create a 64TB volume, enable deduplication, and then notice low optimization rates. This is simply due to deduplication not being able to keep up with the incoming churn from a dataset that is too large on a configured volume. Deduplication jobs in Windows Server 2012 and 2012 R2 are scoped at a volume level and are single threaded (one core per volume). Therefore, in order to exploit additional compute power of the machine with deduplication enabled volumes, the dataset should be distributed over more volumes instead of creating a single large volume with all the data.

Checking Your Current Configuration

If you have an existing system with deduplication enabled on one or more volumes, you can do a quick check to see if your existing volume sizes are adequate.

The following script can help quickly answer if your current deduplication volume size is appropriate for the workload churn happening on the storage or if it is regularly falling behind.

  • Run your workload normally on the volume as intended (store data there and use it as you would in production)
  • Run this script and note the result:

$ddpVol = Get-DedupStatus <volume>

switch ($ddpVol.LastOptimizationResult) { 

   0 { write-host “Volume size is appropriate for server.” }

   2153141053 { write-host “The volume could not be optimized in the time available. If this persists over time, this volume may be too large for deduplication to keep up on this server.” }

   Default { write-host “The last optimization job for this volume failed with an error. See the Deduplication event log for more details.” }

}

If the result is that the volume size is appropriate for your server, then you can stop here (and work on your other tasks!)

If the result from the above script is that the volume cannot be optimized in the time available, administrators should determine the appropriate size of the volume for the given time window to complete optimization.

Estimating Deduplication Volume Size

Let’s start with some basic principles:

  • Deduplication optimization needs to be able to keep up with the daily data churn
  • The total amount of churn scales with the size of the volume
  • The speed of deduplication optimization depends significantly on the data access throughput rates of the disk storage subsystem.

Therefore, to know how to estimate the maximum size for a deduplicated volume, it is required to understand the size of the data churn and the speed of optimization processing.

The following sections provide guidance on how to determine maximum volume size using two different methods to determine data churn and deduplication processing speed:

Method 1: Use reference data from our internal testing to estimate the values for your system

Method 2: Perform measurements directly on your system based on representative samples of your data

Scripts are provided to then calculate the maximum volume size using these values.

  

Estimating Deduplication Volume Size – Method 1 (Easier but less accurate)

From internal testing, we have measured deduplication processing, or throughput, rates that vary depending on the combination of the underlying hardware as well as the types of workloads being deduplicated. These measured rates can be used as reference points for estimating the rates for your target configuration. The assumption is that you can scale these values according to your estimate of your system and data workload.

For roughly estimating deduplication throughput, we have broken data workloads into two broad types.

  • General-Purpose File Server – Characterized by existing data that is relatively static or with few/infrequent changes and new data that is generally created as new files
  • Hyper-V (VDI and virtualized backup) – Characterized by Virtual Machine data which is stored in VHD files. These files are typically held open for long periods of time with new data in the form of frequent updates to the VHD file.

Notice that the form the data churn takes is very different between the general-purpose file server and the Hyper-V workloads. With the general-purpose file server, data churn usually takes the form of new files. With Hyper-V, data churn takes the form of modifications to the VHD file.

Because of this difference, for the general-purpose file server we normally talk about deduplication throughput in terms of time to optimize the amount of new file data added and for Hyper-V we normally talk about this in terms of time to re-optimize an entire VHD file with a percentage of changed data. The two sections below show how to do the volume size estimate for these two workloads for Method 1.

Notes on the script examples

The script examples given in this section make two important assumptions:

  • All storage size numbers assume binary prefixes. Throughout this article, binary prefixes are implied when measuring storage sizes. This means that when the term “megabyte”, or “MB”, is used, this means (1024)*(1024) bytes, and when the term “gigabyte”, or “GB”, is used this means (1024)*(#MBs), and so on. All calculations and displayed numbers for storage sizes in Windows Server 2012 and Windows Server 2012 R2 follow this same convention.
  • Throughput processing time estimates are rounded up. Queuing theory states that a system’s processing (service) rate must be greater than (and not just equal to) the incoming job (generation) rate or eventually the queue will always grow to infinity. The scripts round up the ratio of the optimization time required to the optimization window length. However, it is recommended to be conservative when specifying your daily optimization window and in general to use a lower number than the maximum time expected. If your environment is expected to have a high level of variability in data churn, further scale down your estimated optimization window length accordingly.

General Purpose File Server – Method 1

As noted above, the deduplication of general-purpose file server workloads is primarily characterized by the optimization throughput of new data files. We have taken measurements of this throughput rate for two different hardware configurations running both Windows Server 2012 and Windows Server 2012 R2. The details of the system configurations are listed below. Since the throughput rate is primarily dependent on the overall performance of the storage subsystem, you can scale these rates according to your estimate of your system’s performance compared to these reference configurations. Scale up the throughput for higher performance storage and scale down the throughput for lower performance storage.

The table below lists the typical re-optimization deduplication throughput rates for General Purpose File Server workloads for the two tested reference systems.

Deduplication throughput rates for new file data (general-purpose file server workload)

System 1

Drive Types

SATA (7.2K RPM)

Raw disk speed(Read/Write)

129 MBps/109 MBps

Drive configuration

3 drives, spanned (RAID 0) into single volume

Memory

12 GB

Processor

2.13 GHZ, 1 x L5630, quad core with Hyper Threading

System 2

Drive Types

SAS (15K RPM)

Raw disk speed(Read/Write)

204 MBps/202 MBps

Drive configuration

4 drives, spanned (RAID 0) into single volume

Memory

16 GB

Processor

2.13 GHZ, 1 x L5630, quad core with Hyper Threading

Windows Server 2012

~22 MB/s

~26 MB/s

Windows Server 2012 R2

~23-31 MB/s

~45-50 MB/s

  

Two points to note from the measured throughput rates in the table:

  • Throughput increases from System 1 to System 2 as expected given the increase in drive performance and number of drives used
  • Throughput increases overall in the Windows Server 2012 R2 release, and more for the SAS configuration. This is due to overall efficiency enhancements as well as the use of read-ahead which leverages the queueing capabilities of SAS drives.

Rough guidelines for estimating the typical churn rates of General Purpose File Servers are to use values in the 1% to 6% range. For the examples below, a conservative estimate of 5% is used.

Given the typical optimization throughput values from the table and using an estimate of the churn rates of the files, administrators can estimate if deduplication can keep up with their needs by using the following script to calculate a volume size recommendation.

# General Purpose File Server (GPFS) workload volume size estimation
# TotalVolumeSizeGB = total size in GB of all volumes that host data to be deduplicated
# DailyChurnPercentage = percentage of data churned (new data or modified data) daily
# OptimizationThroughputMB = measured/estimated optimization throughput in MB/s
# DailyOptimizationWindowHours = 24 hours for background mode deduplication, or daily schedule length for throughput optimization
# DeduplicationSavingsPercentage = measured/estimated deduplication savings percentage (0.00 – 1.00)
# FreeSpacePercentage = it is recommended to always leave some amount of free space on the volumes, such as 10% or twice the expected churn

write-host “GPFS workload volume size estimation”
[int] $TotalVolumeSizeGB = Read-Host ‘Total Volume Size (in GB)’
$DailyChurnPercentage = Read-Host ‘Percentage data churn (example 5 for 5%)’
$OptimizationThroughputMB = Read-Host ‘Optimization Throughput (in MB/s)’
$DailyOptimizationWindowHours = Read-Host ‘Daily Optimization Window (in hours)’
$DeduplicationSavingsPercentage = Read-Host ‘Deduplication Savings percentage (example 70 for 70%)’
$FreeSpacePercentage = Read-Host ‘Percentage allocated free space on volume (example 10 for 10%)’

# Convert to percentage values
$DailyChurnPercentage = $DailyChurnPercentage/100
$DeduplicationSavingsPercentage = $DeduplicationSavingsPercentage/100
$FreeSpacePercentage = $FreeSpacePercentage/100

# Total logical data size
$DataLogicalSizeGB = $TotalVolumeSizeGB * (1 – $FreeSpacePercentage) / (1 – $DeduplicationSavingsPercentage)

# Data to optimize daily
$DataToOptimizeGB = $DailyChurnPercentage * $DataLogicalSizeGB

# Time required to optimize data
$OptimizationTimeHours = ($DataToOptimizeGB / $OptimizationThroughputMB) * 1024 / 3600

# Number of volumes required
$VolumeCount = [System.Math]::Ceiling($OptimizationTimeHours / $DailyOptimizationWindowHours)

# Volume size
$VolumeSize = $TotalVolumeSizeGB / $VolumeCount

write-host
write-host “Data to optimize daily: $DataToOptimizeGB GB”
$OptimizationTimeHours = “{0:N2}” –f $OptimizationTimeHours
write-host “Hours required to optimize data: $OptimizationTimeHours”
write-host “$VolumeCount volume(s) of size $VolumeSize GB is recommended to process”
write-host

Example 1:

Assume a general-purpose file server with 8 TB of SAS storage available is running Windows Server 2012 R2 with deduplication enabled is scheduled to operate in throughput mode at night for 12 hours. From the Server Manager UI or the cmdlet get-dedupvolume, the admin sees deduplication is reporting 70% savings.

Using the table above, we get the typical optimization throughput for SAS (45 MB/s) and assume 5% file churn for the General Purpose File Server.

After plugging in these input values of the scenario into the script:

PS C:deduptest> .calculate-gpfs.ps1

GPFS workload volume size estimation
Total Volume Size (in GB): 8192
Percentage data churn (example 5 for 5%): 5
Optimization Throughput (in MB/s): 45
Daily Optimization Window (in hours): 12
Deduplication Savings percentage (example 70 for 70%): 70
Percentage allocated free space on volume (example 10 for 10%): 10

the calculation script outputs:

Data to optimize daily: 1228.8 GB
Hours required to optimize data: 7.77
1 volume(s) of size 8192 GB is recommended.

So, we can expect a server with a single 8 TB volume and 5% churn to be able to process the ~1.2 TB of changes in under 8 hours. The deduplication server should be able to complete the optimization work within the scheduled 12-hour night window.

Example 2:

We can also see if the same server was using SATA instead of SAS ($OptimizationThroughputMB = 23 MB/s), the script would recommend having 2 volumes to complete the optimization work within the same 12-hour window.

Data to optimize daily: 1228.8 GB
Hours required to optimize data: 15.20
2 volume(s) of size 4096 GB is recommended.

If a 17 hour optimization window were available for the same SATA hardware only a single 8 TB volume would be needed.

Data to optimize daily: 1228.8 GB
Hours required to optimize data: 15.20
1 volume(s) of size 8192 GB is recommended.  

Hyper–V (VDI and Virtualized Backup) – Method 1

As noted above, the deduplication of Hyper-V workloads is primarily characterized by the re-optimization throughput of existing VHD files. We have taken measurements of this throughput rate for a VDI reference hardware deployment running Windows Server 2012 R2.

The table below lists the measured re-optimization deduplication throughput rates for Hyper-V VDI workloads running on the VDI reference system.

Deduplication throughput rates for VHD files (Hyper-V VDI workload) 2

Storage Spaces configuration (SSD, HDD)1

Hyper-V

(Windows Server 2012 R2)

Re-optimization (background mode) of VHD file: ~200 MB/s

Re-optimization (throughput mode) of VHD file: ~300 MB/s

1 Using a VDI reference hardware deployment (with JBODs) detailed here

2 Note that these rates are much larger than those listed for processing new file data for the general-purpose file server scenario in the previous section. This is not because the actual deduplication operation is faster, but rather because the full size of the file is counted when calculating the rates and for VHD files in this scenario only a small percentage of the data is new.

Note from the table that the throughput rates will typically differ depending on the scheduling mode chosen for deduplication. When the “BackgroundModeOptimization” job schedule is chosen, the optimization jobs are run at low priority with a smaller memory allocation. When the “ThroughputModeOptimization” job schedule is chosen, the optimization jobs are run at normal priority with a larger memory allocation. (For more information on configuring deduplication, refer to Install and Configure Data Deduplication on Microsoft TechNet.)

Rough guidelines for typical churn rates of Hyper-V VDI workloads are usually around 5-10% churn, which is reflected in the deduplication throughput rates listed. If you expect more or less churn, you can scale these values accordingly to estimate the impact on recommended volume size (where the processing rate increases with less churn and decreases with more churn).

Given the typical optimization throughput in the table, administrators can estimate if deduplication can keep up with their needs by using the following script to calculate a volume size recommendation.

# Hyper-V VDI workload volume size estimation
# TotalVolumeSizeGB = total size in GB of all volumes that host data to be deduplicated
# VHDOptimizationThroughputMB = measured/estimated optimization of VHD file throughput in MB/s
# DailyOptimizationWindowHours = 24 hours for background mode deduplication, or daily schedule length for throughput optimization