Windows 2012 introduced deduplication feature that provides great savings for the file systems with lot of redundant data. File systems in general, most of the data is cold meaning most of the files are not changed. Windows 2012 deduplicates the cold file contents where all the common “chunks” in various files are stored in common area and the actual file will have links to the common chunk area. This will lead to a huge savings when there is lot of duplicate content in the files.
DPM 2012 SP1 can now protect a Windows 2012 deduplicated volume efficiently. When user chooses to protect a full volume that is deduplicated, DPM recognizes that this is a deduplicated volume and copy the content efficiently providing huge network and DPM storage savings. When a file system is deduplicated, Windows 2012 keeps all common chunks under chunk store located in sysvol. All the files will have links called reparse points that will point to chunk store. DPM initially copies the whole chunk store and all files in deduped format. In subsequent delta replications (DRs), DPM tracks changes in both chunk store as well as on the files and transfer only changed content. This means that DPM transfers the deduplicated volume in a dedup format.
Deduplication can be enabled on a volume as described here. Once this is enabled, deduplication logic will work on the “cold” files as configured by user and deduplicates data causing the storage reduction on the volume. Once the PG is created with the whole volume backup, DPM has intelligence built in that detects deduped volume and backs up data efficiently. For ex., if a volume has 100GB of files before deduplication and its storage consumption gone down to 70GB after deduplication, DPM transfer the content as part of Initial Replication (IR) as 70GB over the wire and store it as 70GB. This provides great DPM network and storage savings. Here are the steps to be followed to leverage this capability.
1) Assume that the PS1 is the production server where the file system volume (Vol1) is residing and DPM1 is DPM server.
2) Install Deduplication role on PS1
3) Enable Deduplication on Vol1
4) Install DPM server on DPM1
5) Install Deduplication role on DPM1 machine
6) Install DPM agent on PS1
7) Create Protection Group (PG) and select Vol1 on PS1 with appropriate protection settings
8) DPM will not only recognize that this is a deduplicated volume but also transfers the content efficiently
Even though DPM efficiently backups the file system, backup admin can still leverage DPM’s Item Level Recovery (ILR) capability to recover small set of files or directories instead of the whole volume. DPM is able to achieve this by leveraging Windows 2012 Dedup technology to understand the file system and recover the required items. This is the reason, DPM server should be running on Windows 2012 and Dedup role need to be installed. Note that the Dedup capability should not be enabled on the DPM storage (replica or shadow copy volumes). This efficient backup capability can be availed only when full volume is backed up and restored. Here is the table that shows various scenarios and DPM’s protection and recovery efficiency capabilities.
Internal workings of DPM Dedup Backup and Recovery:
Windows 2012 deduplicates data at a volume level and stores all “dedup” chunks in sysvol folder called chunkstore. All the files that has the “duplicate” content will point to this chunk store. By having links for duplicate content, Windows is able to reduce the storage consumption. DPM agent on the file server recognizes that the volume has dedup enabled, reads the files in “shallow” form and stores on DPM in “shallow” format. DPM also copies the whole chunk store located in sys volume folder as is. DPM expanded its “expressfull” technology to Dedup file system protection as well. This means, the DPM will continue to track the file changes and at the time of backup DPM will just copy the changed content.
There are various kinds of recovery options available with DPM. Each kind of recovery has a specific requirement to leverage the dedup efficiencies. All of these details are captured below. Note that all of the below scenarios assumes that the source volume was deduped at protection time DPM protected full volume efficiently.
Once the Dedup file system is protected to primary DPM server, the file system cannot be protected to secondary DPM server. Deduped file system protection to DPM server can be further extended to Online protection by opting-in for Windows Azure Backup. As the Windows Azure Backup supports “subset” of the protected resources, the protection to Azure will be done only for files that were selected for DPM Azure and will be done in unoptimized format. End User Recovery feature is not supported for Dedup Volume protection.
One frequently asked question is, why shouldn’t we enable dedup on DPM storage volumes (replica and shadow copy volumes). This needs understanding of how DPM stores its backup data. DPM has replica volume which reflects latest and greatest snapshot of the production server. As part of IR, Replica Volume will reflect the production server and a snapshot is created. At the time of next backup, DPM copies the new content onto replica volume which will cause a Copy On Write (COW) onto shadow copy volume as VolSnap is keeping the old snapshot intact. After backup completes, DPM creates a snapshot on Replica volume. VolSnap will do COW for any subsequent changes to this “snapshotted” volume. So, when dedup engine try to deduplicate the content, all writes on Replica Volume will lead to COW and so bloats up diff area. So, actual DPM storage consumption will go up due to this diff area increase. Another issue is that DPM’s CC logic will not work as the files on DPM side and on production side are mismatching. This makes CC think that backups are not proper and transfer all content again. So, dedup should not be enabled on DPM storage volumes.
Another interesting scenario where dedup is enabled on the volume that is already being backed up. When dedup is enabled on the volume, dedup will change almost all of files as part of dedup logic even though actual content is not changed. In the next backup, DPM sees this as file changes and will transfer all deduped files. This leads to a one time spike in DPM backup storage consumption.