Is your organisation storing more than ever before? Dedup might be the answer.

Article
10/05/2012

We all want to store more information. Be it our ever growing email archive, our collection of family photos, or our customer invoices the information that we and our businesses need to store is an ever increasing volume. The amount of storage available to you or your organisation may of course not quite be able to grow at such a rate because while disk is an every cheaper resource, it’s still not free. There are many options increasing your storage capability, off premises archiving to cloud storage for example but that can mean moving the cost elsewhere (bandwidth for example). A better option could be to decrease what you need to store.

Of course I’m not suggesting that you should go around deleting a whole bunch of user’s files, which would be bad and probably result in a P45 saying hello. You could ask your users to delete their own files which some may do, many however will take the view that their time is more important than the storage costs. Some would also be pig headed and ask why, when a disk costs £70 for 2TB, they should have to delete their stuff. Many will also be completely clueless as to their disk usage.

Windows Server 2012 comes to your rescue with a great feature called Deduplification (dedup) which works some magic and actually cuts down the amount of data you need to store without losing any of the data. Frankly it’s a little bit like magic.

Essentially what Dedup does is looks at what’s stored in a volume and looks for matches between chunks of data. When it spots two chunks that are identically it removes the second copy of that chunk freeing up the disk space that was consumed by that duplicate chunk of data and pointing any disk requests for that data chunk to the other copy of the chunk. A simplified example will help understanding, don’t get too hung up on the detail here – like the fact we’re using words, those are just an abstraction for illustration.

Example:

Your disk stores words, the words HELLO MARY HOW ARE YOU and HELLO DAVID HOW ARE YOU TODAY. All we really need to store is what’s unique, everything else is just duplication, so we store HELLO MARY HOW ARE YOU DAVID TODAY. Doing that saves us the second HELLO, HOW, ARE and YOU, or 11 letters, or about 38% of the storage originally needed for the 37 letters of the original sentences.

Dedup doesn’t however look at your data and workout what words are duplicated over and over, that would be inefficient as you store other data in many formats that might not be actual words. However all data is stored in bits on your disks, so Dedup looks at the bits on a disk but of course looking at bits is too granular (they are all 1 and 0 obviously) so context would be lost. Dedup instead looks at chunks of data that have identical patters. When a chunk is spotted with an identical pattern it is considered a duplicate and deduplicated. What is very clever though is how dedup decides on those chunks by looking how to make the most efficient savings and changing the size of the chunks of deduplication. Another example will help, again with words.

Example 2:

Your disk stores words, the words HELLO MARY HOW ARE YOU TODAY and HELLO MARY HOW ARE YOUR CHILDREN TODAY. This time the deduplicated disk only stores HELLO MARY HOW ARE YOU TODAY R CHILDREN. In this second example we don’t need to store the word YOUR even though it’s a new word because it still matches a smaller chunk for the most part.

One of the coolest things about dedup is that it works at this lower than the file, higher than the bits level so it can dudpe across file types, across file boundaries and any physical disk boundaries such as disk block size. This means that for example should an Excel file contain the word CONTOSO and that exact same word is in a TEXT file the two could theoretically duplicate against each other.

We’ve been introducing this topic at our IT camps and getting the audience to test their own file servers using the DDPEVAL.EXE tool. You can get this tool from any Windows Server 2012 computer with Dedup enabled and run it, non-intrusively, on any volume or share to evaluate how much space dedudp will save you (just follow up to step 2 below and you’ll find the exe in Windows\system32). Attendees are seeing between 22% and 75% potential savings on profile, development and file server shares.

If you’re sat there reading this thinking about data integrity then you get extra marks. If you’re deduping you do put extra reliance on the one copy of the data that you do have. For that reason dedup will only use one deduplicated chunk 1000 times, then the 1001^st occurrence of the same chunk is spotted it leaves it and dedups against that chunk for the subsequent 1000 duplicates found. Furthermore the deduped chunk is maintained by re-writing the chunk when a process writes any data that contains that chunk. This along with other controls maintains consistency.

If you’re using BranchCache you should also be jolly happy because the two technologies work together to reduce duplication in branch environments too.

Enabling Dedup is a case of adding the feature in to Windows Server 2012, which it’s self is easy to do.

1. From Server Manager select Manage > Add Roles and Features then select the server you want to add Dedup to.

2. On the Server Roles wizard page expand File and Storage Services > Files and iSCSI Services and check Data Deduplication then complete the wizard to install the feature.

3. Select the File and Storage Services node in server manager.

4. Select Volumes and locate the server you enabled deduplication on (hint – if you don’t see it you need to add the server into Server Manager). Then select the volume on the server you wish to dedup.

5. Right click the volume and select Configure Data Deduplication.

6. Check Enable data deduplication. From here you need to select a minimum age for files to be duduplicated, this prevents files that are changing too frequently from needless deduplication saving server resources. Enter any particular file types to skip, VHDs are skipped for example because they are open for long periods, you can also specify specifc folders to include or exclude and specify a schedule for running dedup jobs. Click OK to apply the changes.

That’s all there is to it to enable deduplication, the first dedup job will run when the schedule allows. There is much more that can be done with PowerShell, but by way of a teaser the following commands are useful:

Get-dedupjob Shows the current dedup job status if a job is running.

Get-dedupstatus Shows how much deduplication has occurred – this will show the savings.

Start-dedupjob Starts immediate deduplicaiton.

Dedup is a great tool in the arsenal of any IT guy struggling with data storage costs, give it a try using DDPEVAL and see if this one feature alone is going to make Windows Server 2012 right for you, it just might!

If you want more technical information on Data Dedup then checkout Data Deduplication TechNet library and download the Windows Server 2012 Evaluation.

Is your organisation storing more than ever before? Dedup might be the answer.

Additional resources