Q&A With Matthias Wollnik, Senior Program Manager for Data Deduplication in Windows Server 2012 R2

Hi Folks –

Back in December, I wrote a blog article on Data Deduplication, which was first introduced in Windows Server 2012 and Windows Storage Server 2012. It’s been improved in Windows Server 2012 R2 and Windows Storage Server 2012 R2, and it is quickly becoming a very popular feature. In this post, I’ll share the perspective of Matthias Wollnik, a Senior Program Manager for Data Deduplication at Microsoft:

clip_image002

Q: Why did Microsoft develop Data Deduplication? What customer pain(s) did you set out to solve?

When we looked at what we could do to best serve our customers’ storage needs, we saw that:

  • The amount of data being created is continuing to grow very rapidly—to the point that companies are facing increased storage costs even as the average cost-per-gigabyte continues to fall.

  • Many companies are turning to deduplication to ease these pains, but until now those solutions have been fairly expensive.

Based on these factors, we saw an opportunity to deliver new customer value by building Data Deduplication into Windows Server—in a way that would both help companies save money on storage and reduce related storage management costs.

Q: Where can I get Data Deduplication and what does it cost?

Data Deduplication is built into Windows Server 2012 R2 Standard, Windows Server 2012 R2 Datacenter, and Windows Storage Server 2012 R2 Standard—as well as the 2012 (pre-R2) versions of those editions. It’s a configurable feature under the File and Storage Services role and can be managed via Server Manager or Windows PowerShell. And because it’s built-in and ready to use, there’s no additional cost beyond that of the operating system.

Q: Are there any system requirements for using Data Deduplication?

Unlike some other solutions, Data Deduplication does not require any additional hardware. System requirements and considerations include the following:

  • It uses half of the available server memory at most.
  • We recommend 2 GB of usable working memory for each terabyte of data in a volume. So if you want to deduplicate a one-terabyte volume, you’ll want at least 4 GB of total server memory.
  • Data Deduplication is also compute-intensive, which is why using it for live VDI workloads requires the storage and compute nodes to be connected remotely via SMB.

Because it works entirely at the file server level, the clients that connect to the server can be running any operating system. Data Deduplication doesn’t care if you’re using SMB3, SMB2.1, or NFS file protocols to access a share where the data is stored, or if it’s just local data that isn’t exposed for remote access.

Q: What are the recommended use cases for Data Deduplication?

Data Deduplication is recommended for—and delivers significant results—on home directory shares, group file and collaboration shares, software deployment shares, and VHD libraries. There are several things to consider when determining whether to use Data Deduplication:

  • It is designed for NTFS data volumes.
  • It works on any “cold” files that are not currently in use.
  • It can also be used to optimize virtual disks for running VDI workloads—provided that the storage and compute nodes for the VDI infrastructure are connected remotely via the SMB protocol. (Note: This capability is new for Windows Server 2012 R2 and Windows Storage Server 2012 R2. Everything else discussed in this Q&A applies to pre-R2 versions as well.)
  • It does not support boot or system drives.
  • Microsoft does not recommend or support using it on SQL Server and Exchange Server files, which, even if cold, will not benefit much from deduplication.

Q: Microsoft says Data Deduplication can reduce required disk space by up to 90 percent. What kind of results can a company expect?

The amount of disk space you’ll save depends on the type of data being stored:

  • From both Microsoft internal testing and that performed by ESG Lab, Data Deduplication has shown a savings of 25-60 percent for general file shares and 90 percent for operating system VHDs.
  • You can use the Deduplication Evaluation Tool to determine the expected savings that you would get if deduplication were to be enabled on a particular local or remote folder. (Note: More information on this tool can be found here and here.)

Q: How does Data Deduplication work?

Data Deduplication reduces the amount of physical disk space required to store a given amount of logical data. During the deduplication process,

  • Examines and segments files into variable-sized chunks.
  • Identifies duplicate chunks that appear in more than one file.
  • Maintains a single copy of each chunk in a compressed format in a central repository.
  • Replaces each deduplicated file with a much-smaller reference that indicates which chunks are used by the file.

When a deduplicated file is read, a filter in the read-path reassembles the file in a manner that is transparent to the calling application or user.

Q: What opportunity does Data Deduplication present for Microsoft partners who help companies deploy Windows Server?

Data Deduplication is a great thing to offer to setup for customers for two reasons:

  • It can help them save significantly on storage costs.
  • It’s low-touch—just turn it on, optionally adjust the default configuration parameters such as file age, and walk away.

That said, the larger opportunity for Microsoft partners is that Data Deduplication is a great enabler for new VDI solutions, as supported by the ability to deduplicate running VDI workloads that we added in R2. Because it greatly reduces the amount of storage that’s required for VDI VHDs, it makes new VDI solutions a lot more cost-competitive.

Q: Can you provide an example of how much a company might save in a VDI scenario?

Let’s assume that you want to deploy 100 VDI VMs at 40 GB per desktop, and that for performance and reliability considerations you want to use mirrored, high-performance, dual-port SAS2 SSD drives:

  • Without Data Deduplication, you would need 8 TB of storage, which is 40GB x 100 VMs x 2 copies of the data. Given that a one-terabyte SAS2 SSD drive costs upwards of $2000 today, that’s at least $16,000 in SSD costs, not counting the cost of the storage server or appliance itself.
  • With Data Deduplication, which provides up to a 90 percent reduction in required disk space for VDI VHDs, you could reduce the total required disk space to about one terabyte and just pay a few thousand dollars for the SSDs.

Thanks to Matthias for taking the time to discuss some of the most commonly asked questions about Data Deduplication. For more information on Data Deduplication see my blog here and for more info on how to deploy it for VDI storage, see Matthias’s blog articles here and here.

Cheers,
Scott M. Johnson
Senior Program Manager
Windows Storage Server
@supersquatchy