Better backups with DPM – the story so far…
DPM has consistently aimed at making your backups more reliable. The transition from tape to disk as a medium for storing backups provided the first impetus to improved reliability of backups, and was delivered with DPM 2006, our first release. It also brought about novel concepts of improving the ease of use for administrators with two key innovations:
1. Think SLAs, not jobs: DPM allows you to think in terms of business goals and SLA requirements, rather than do the math for finding out what jobs need to run at what frequency and from which source to which target. You need to identify your business requirements and feed it to a wizard in DPM, and DPM figures out the rest (allocating disk space, scheduling jobs etc.) for you.
2. Alert based monitoring: You don’t have to deal with individual failures to identify the root causes. DPM filters out the noise and shows what’s actionable.
Next, DPM 2007 built on top of the innovations that had already been put in place and expanded the scope of protection to reliable support for key Microsoft workloads, while adding tape backups and disaster recovery options. The focus was to back up the application data in the consistent and supported way. This was made possible by leveraging the Volume Shadow copy Service technology in Windows, which guarantees recoverability and allows backups with minimal downtime of the application.
Improvements in DPM 2010
DPM 2010 continues on this path of improving reliability and makes substantial improvements towards reducing the burden on the IT administrator to meet their SLA for backups. Feedback from customers has indicated that SLA requirements are becoming more stringent and IT staff bandwidth to deal with failures is getting further limited. Given these customer problems, DPM 2010 evolves further in the soon to come release. The key improvements are towards taking automatic action for common failures and environment changes as follows:
1. Automatic rerun of failed jobs
2. Automatic consistency check for DPM replicas when they become inconsistent with the data source
3. Automatically grow volumes when used space approaches the allocated disk space
4. Automatically discover and protect new protectable items in SQL servers and SharePoint farms
This post describes how details of this automatic behavior and how it helps you meet your SLA, and its limitations.
Automatic rerun of failed jobs
DPM 2010 now automatically reruns failed jobs by default. Jobs that don’t run frequently are automatically retried after the failure. By default, DPM retries the failed job only once and 1 hour after the original failure.
This also changes how DPM shows alerts related to failures. If a job fails, you will not see an alert immediately and DPM’s automatic attempts to fix the issue will kick in. The alert will become visible when one of the following conditions is met:
1. DPM decides not to auto-rerun for the failure. This can happen if the next scheduled job is less than 4 hours away or if the job failed because it was canceled.
2. There issue is not fixed automatically within 3 hours of the alert being created.
3. DPM has attempted the automatic rerun as per the configured settings, but the retries have not fixed the issue.
4. If you perform a modify or delete operation on a protection group and DPM has a pending issue for that protection group to take automatic action for at that time, the alert will become visible and DPM will not take action for that issue.
Configuring automatic rerun parameters
The following registry keys can configure various parameters for the automatic rerun behavior. It is recommended that you do not modify these, unless you are aware of specific situations in your environment where changes are likely to help.
All the keys are under the hive HKLM\Software\Microsoft\Microsoft Data Protection Manager\Configuration
|Key||Type||Setting the key controls||Implications|
If set to a non zero value, DPM will not automatically rerun jobs.
The delay in time before which DPM will attempt to automatically rerun failed jobs
This should be changed if you have typical production server down time/network down time in excess of the default value of 60 mins.
|AutoRerunNumberOfAttempts||DWORD||The no. of times a failed job will be retried before giving up if it consistently fails.||
Default is one. Increasing this value may increase the load on your system.
The reruns are done at the gap of AutoRerunDelay.
Automatic consistency check options
Why is consistency check needed?
Before I explain how automatic consistency checks will work, let’s understand why DPM sometimes require running a consistency check.
As you know, DPM provides fast and efficient backups by means of the express full technology that is implemented with the help of a filter driver. This filter helps DPM identify changed blocks since the last backup which are synchronized to the replica on the DPM server on every express full. However, under some conditions (like a forced shutdown of a server or an unclean failover in a cluster), this tracking cannot be relied upon and a consistency check is needed to accurately track data. The consistency check is simply an operation that compares all the blocks of data on the source and target to identify the changed blocks. Once a consistency check is successfully done, a new baseline is established for the filter to track changes and continue backups with the more efficient express fulls.
When DPM detects that a consistency check is needed, it marks the replica as inconsistent. In this situation, all further synchronizations fail until an administrator intervenes and runs a consistency check. While this situation happens less often than an express full failure, it can cause higher impact to your SLAs if not acted upon. While you could run scheduled consistency check in DPM 2007, in DPM 2010, you have a couple of options to choose from.
How do I run consistency checks automatically?
Some production server workloads run with sufficient spare capacity to support a consistency check operation running during work hours. This should be fine for most workloads. In some scenarios, server owners may decide to leave out consistency check operations to an off-peak hour to minimize impact to production services. To address both these scenarios and reduce the manual intervention required by administrators, DPM 2010 offers the following options to automate consistency checks:
1. Run consistency check soon after DPM detects that a replica is inconsistent. This causes DPM to run a consistency check automatically, 15 minutes after a replica becomes inconsistent. This option is recommended for most deployments. In our experience, for most scenarios, backup administrators have run consistency check operations immediately on detecting issues without impacting production workloads significantly. This option helps meet your SLA as defined in the protection group.
2. Run consistency check once a day during off peak hours, if needed. This causes DPM to run a consistency check if the replica is inconsistent at the scheduled time. This option helps maintain a “once a day” backup in the worst case. DPM 2007 users may recognize this option as being available from “performance optimization” for a protection group.
Both the above options are available when creating a protection group. To change the option later, you need to choose the option to modify a protection group and make the appropriate changes.
Note that unlike automatic rerun of jobs, automatic consistency checks are attempted once only. To do multiple attempts at fixing an inconsistent replica, you can choose both options.
Automatically growing DPM volumes
The disk to disk backup paradigm imposes another overhead on administrators – that of estimating and managing their disk space allocations. This can be hard to get right and there’s a chance of over-allocating or under-allocating disk space. Also, in some cases, applications can generate a large amount of churn which fills up the DPM recovery point volume. In other cases, the default allocation by DPM (which is based on average churn of data we’ve seen for different kinds of data sources) may turn out to be insufficient. In such cases, DPM can run out of allocated disk space and cause backups to fail until an administrator acts and provides more storage.
Using thin provisioning of storage on a SAN is an interesting related topic which can help prevent over-allocating up-front and repeated alerts, but we’ll save that for a later blog post J. It is important however to note that you should not under allocate disk space and let DPM automatically grow to the right size – you may run into some issues in the long run as described in Ruud’s blog post here.
DPM 2010 offers an option to automatically grow volumes when the utilization approaches capacity. How to set this option? This can be set while creating a protection group, or by choosing the option to Modify Disk Allocation from the Protection task area of the DPM administration console. You can change this setting at any time.
What does this option do? The growth factor is fixed at 25% and at the minimum at 10 GB. Is that too high? It has been fixed based on the following considerations to ensure overall reliability of backups:
1. DPM servers typically host very large number of volumes on dynamic disks, which allows volumes to be spanned and non-contiguous. However, each extent of a volume takes up additional system resources in Windows. As a DPM installation grows “older”, we can expect higher fragmentation and a risk of hitting system limits that impact the reliability of the backups. Hence, growing by larger amounts is always advised. This is well explained in Ruud’s blog post here.
2. The default replica volume allocations are typically made keeping growth of data in mind. When the volume is approaching capacity, the grow by 25% is attempted to keep space available for a longer period of time before another grow is needed.
3. With DPM 2010, some data sources may be co-located on a single physical replica volume. Hence the growth of data may be coming from multiple sources, requiring a larger quantum of grow.
There are situations where you may not want to use the automatic grow functionality of DPM. You can do this if your data size is strictly controlled (like client data protection scenarios) or your initial allocation of volumes is with the assistance of the DPM storage calculators and takes into account growth of the data over a large period of time.
Also, if you are using custom volumes to backup data for DPMs, then the automatic grow will not happen. DPM will automatically grow only the volumes that it directly manages – that is, volumes that are created by DPM on disks you have added to the storage pool.
Automatic discovery and protection of SQL server databases and SharePoint content databases
DPM 2010 also automates two key scenarios for automatically protecting databases:
1. SQL server databases that get added to an instance that is being protected by DPM are automatically added to protection.
2. Once you’ve configured protection for a SharePoint farm, any new databases in the farm are automatically protected, as long as the DBs are on machines which already have a DPM protection agent configured and attached to the DPM server backing up the farm.
To achieve this automatic behavior, DPM talks to the SQL writer or the SharePoint writer once every 24 hours via a discovery job which runs at 12:00 AM. The results are compared with previous known set of databases and any new databases are configured for protection within the same protection group settings, including tape backups. You do not need to take any action!
For a new SQL server database that is added, new space is allocated automatically if required. For a SharePoint farm, the new database would occupy space within the already allocated space for the replica of the farm. If provisioned adequately for growth, the protection will continue seamlessly. Alternatively, if automatic grow is enabled for the protection group, the first backup may fail due to lack of disk space, but DPM will automatically grow the allocation and the next backup will succeed.
The automatic addition of databases also carries over to the secondary DPM server if it is configured.
DPM however doesn’t automatically remove databases from protection if they are missing. Any missing databases cause backup jobs to fail and are shown as alerts to the administrator.
How do I review the automatic activity DPM is doing?
So how do we know if DPM is doing the right thing? Is it really running jobs automatically? Considering that issues that are automatically fixed up by DPM don’t show up as alerts, this may be a little hard to do.
Here’s how you can find out:
1. Alerts tab: Check the “view inactive alerts” check box. You will see any alerts that DPM might have automatically fixed without alerting you.
2. Jobs tab: See failed jobs and verify if DPM reran it 1 hour later
3. Report: DPM 2010 provides a customizable report to view your SLA. In this report, you can ask DPM to create a view of recovery points available in a given time period. This helps you measure DPM’s performance based on your SLAs rather than individual job failures. To view this, select the items in the protection view and click on the “Recovery point status…” task in the action pane on the right side.
I still see alerts – is DPM really taking automatic action?
The automatic actions taken by DPM are dependent on a new service in DPM 2010 – the DPM Access Manager service (DPMAccessManager). Besides supporting DPM in a number of other features, this service captures failure alerts raised by the DPM execution engine (MSDPM) service and automatically takes action, just like how an administrator would do manually from the user interface.
This also means that the health of this service is important for DPM automatic actions to work smoothly. Here are some things you can check for if you think things aren’t working as expected:
1. Check from the services management console to verify that the DPMAccessManager service is running.
2. Restarting the service may help.
3. Check the event viewer for any error events with source=MSDPM for further information.
4. Send us a post on the DPM community forum and we’ll try to help.
Help us now by testing!
The DPM 2010 beta already has some of this automatic behavior built in. The anecdotal feedback from our customers on forums and the statistics we’ve collected from MSIT’s dogfood deployments and CTP customers that we work closely with has been really encouraging. But don’t take my word for it… test drive it now and let us know how it goes for you 🙂 (https://connect.microsoft.com/Downloads/DownloadDetails.aspx?SiteID=840&DownloadID=22070).
We’ll continue listening and working to improve DPM for you!
— Prashant Kumar | Program Manager | DPM Team | (Author)