How to troubleshoot scheduled backup job failures in DPM 2012 and later versions

~ Sekar Raju

ToolsHi everyone, Sekar Raju here from DPM support team with few tips on troubleshooting scheduled backup job failures in System Center 2012 Data Protection Manager (DPM 2012 or DPM 2012 R2) and DPM 2016. We have seen cases where recovery point jobs were not getting executed as scheduled but the protection status of respective data sources continue to appear as Green/OK in the DPM console. Since the protection status for these data sources are displayed as Green/OK, the DPM administrator might think that everything is going well when actually it is not. This issue usually occurs when the SQL Agent fails to run the scheduled job to invoke the DPM engine to execute the job.

NOTE The Ad-hoc manual jobs would run and complete successfully since the SQL Agent is not utilized when performing manual jobs from the DPM console.

How scheduled job execution works

When protection groups are setup, DPM creates jobs for each data source in SQL to run the backup (e.g. incremental syncs, express full, etc.) for each data source and other maintenance jobs. A component called TriggerJob.exe is used to invoke the DPM engine by passing the Job Definition ID for the data source to begin execution of the backup job. TriggerJob.exe is run by the SQL Agent Scheduler at the scheduled time via the following command syntax:

triggerjob.exe <JobDefinitionID> <ScheduleID> <FQDN-DPMServer>

An example of a typical command run by Schedule Agent Scheduler to begin execution of the job at the scheduled time is below.

C:\Program Files\Microsoft System Center 2012 R2\DPM\DPM\bin\TriggerJob.exe 1bd305ae-f158-4948-93f8-e935103b168f 1e53fd39-0339-4d41-96ec-89fdf587f1e5 <FQDN-DPMServer>

(This path may differ depending on DPM version, and whether this is an upgrade (same path) or a fresh install (different path).

If for some reason the command fails to run and call triggerjob.exe, the DPM engine will not be invoked and thus the backup job will not be executed. Since SQL failed to run the command, DPM will not know about this failure and will continue to display the protection status of the data sources as Green/OK.

Below are a few things that you can check to troubleshoot the scheduled backup job failures.

1. Check the Application Event Log

As you can imagine, when a scheduled backup job fails to be invoked by SQL, DPM doesn’t raise any alerts for those job failures since it was a failure on the SQL side. However, these events are captured in the Application Event log as SQL, Windows Error Reporting or MSDPM events, depending on root of the problem. Please be sure to check the Application Event Log in the Event Viewer and look for any events from SQL that are related to the scheduled job failure. If your DPM computer is using remote SQL server for DPMDB, then review the Application Event Log on the remote server.

For example, the following event may be found in the Application Event Log which indicates that the SQL Agent encountered some problem when trying to run the command line.

Log Name: Application
Source: SQLAgent$MSDPM2012
Date: <Date & Time>
Event ID: 208
Task Category: Job Engine
Level: Warning
Keywords: Classic
User: N/A
Computer: <DPMServerName>
Description:
SQL Server Scheduled Job ‘00890b12-9058-4f42-8143-291dc3de4d78’ (0xC52C50485ED1754EB12D16117B258DD7) – Status: Failed – Invoked on: <Date & Time>- Message: The job failed. The Job was invoked by User <UserName>. The last step to run was step 1 (Default JobStep).

Here is a sample event from Windows Error Reporting;

Fault bucket , type 0
Event Name: DPMException
Response: Not available
Cab Id: 0

Problem signature:
P1: TriggerJob
P2: 3.0.7696.0
P3: TriggerJob.exe
P4: 3.0.7696.0
P5: System.UnauthorizedAccessException
P6: System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal
P7: 20B9A72D
P8:
P9:
P10:
Another sample event from DPM engine;
Log Name: Application
Source: MSDPM
Date: <Date & Time>
Event ID: 976
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: <FQDN of DPMServerName>
Description:
The description for Event ID 976 from source MSDPM cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
The DPM job failed because it could not contact the DPM engine.

Problem Details:
<JobTriggerFailed><__System><ID>9</ID><Seq>0</Seq><TimeCreated><Date & Time> </TimeCreated><Source>TriggerJob.cs</Source><Line>76</Line><HasError>True</HasError></__System><Tags><JobSchedule /></Tags></JobTriggerFailed>

2. Run the job manually from SQL Server Management Studio

You can also try running the job manually from SQL Management Studio. The steps for doing this are below.

a) Open SQL Server Management Studio and connect to the SQL instance used for the DPMDB database. Expand SQL Server Agent and then Jobs. The GUIDs that you see in the list under Jobs are Schedule ID for each individual job. Right click on a job, then click “Start Job at Step” on the context menu.

clip_image002

b) If the job fails to run, you should see an error similar to the one below.

clip_image004

c) If this error occurs then it would confirm that the SQL Agent was not able to run the job due to the permission or some other reason. Please refer to “Check the logon account credentials” section below for troubleshooting this issue further.

Run the job manually from Command Prompt

You can run the triggerjob.exe command line manually to check whether the command will get executed and that the backup job starts in DPM correctly. To do this, follow the steps below.

a) Open SQL Server Management Studio and connect to the SQL instance used for the DPMDB database. Expand SQL Server Agent and then Jobs. Right-click on one of the jobs and then click Properties.

clip_image006

b) On the Properties dialog, click Steps on the left and click the Edit button at the bottom.

clip_image008

c) On the Job Step Properties dialog, copy the command from the command window as shown below.

clip_image010

i. Run the copied command from an elevated command prompt on the DPM server (with local SQL server):

Example: C:\Program Files\Microsoft System Center 2012 R2\DPM\DPM\bin\TriggerJob.exe F60C8734-2DF5-4E86-8C7D-43558BD5A071 2F481ACB-2C3D-4F48-8C70-CA989C3E8FF2 <FQDN of DPMServer>

If the command runs successfully then the problem is likely related to incorrect permissions or log-on credentials. Please refer to the “Check the logon account credentials” section below for troubleshooting this issue further.

ii. Run the command from an elevated command prompt on the remote SQL server (if applicable):

Example: C:\Program Files\Microsoft Data Protection Manager\DPM2012R2\SQLPrep\TriggerJob.exe F60C8734-2DF5-4E86-8C7D-43558BD5A071 2F481ACB-2C3D-4F48-8C70-CA989C3E8FF2 <FQDN of DPMServer>

In the remote SQL server scenario, if the command completes successfully on the DPM server but fails on the remote SQL server, we need focus our troubleshooting efforts on the remote SQL server to rule out any permission, network and firewall issues.

NOTE

When you look at the list of Schedule IDs for the jobs in SQL, it might be challenging to find the mapping of the Schedule ID and the data source it is associated with. You can run the following SQL query to find more details about the jobs with some user friendly information:

use DPMDB     –Change to actual name of DPMDB if it is different
select
      sche.ScheduleId as ‘SQL agent Schedule Job Name’,
      sche.JobDefinitionId,
      prot.FriendlyName as ‘Protection Group’, am.ServerName as ‘Servername or NULL’,
     case
            when jobd.type = ‘C9B259D2-6402-486D-8E36-C6C1ADAE0912’ then ‘Maintenance job that runs @ midnight’
            when jobd.Type = ‘3D859D8C-D0BB-4142-8696-C0D215203E0D’ then ‘Synchronization (file/volume) / Express Full (application)’
            when jobd.Type = ‘84021B5E-B4DC-9B27-2B7E-3B99BB1225FF’ then ‘Volume/Share/System State Recovery Point’
            when jobd.Type = ‘913afd2d-ed74-47bd-b7ea-d42055e5c2f1’ then ‘Backup to tape (D-T)’
            when jobd.Type = ‘B5A3D25C-8EB2-4032-9428-C852DA5CE2C5’ then ‘Backup to tape (D-D-T)’
            when jobd.Type = ‘C4CAE2F7-F068-4A37-914E-9F02991868DA’ then ‘Consistency Check’
            when jobd.Type = ‘5ECC82D0-3475-4E81-8ADD-55B1C1D23DB1’ then ‘SharePoint catalog generation’
            when jobd.Type = ‘6E7C76F4-A832-4418-A772-8E58FD7466CB’ then ‘Azure Online backup’
     end
       as Operation
from tbl_SCH_ScheduleDefinition sche
left join dbo.tbl_JM_JobDefinition jobd
join tbl_IM_ProtectedGroup prot
on jobd.ProtectedGroupId = prot.ProtectedGroupId
on sche.JobDefinitionId = jobd.JobDefinitionId
left join dbo.tbl_AM_Server AM
on AM.ServerId = jobd.serverid
where sche.IsDeleted = ‘0’ and jobd.ProtectedGroupId is not null

order by prot.FriendlyName

The output of the SQL query will look similar to the one below. Based on this output, you can pick a Schedule ID for a data source that is small and quick to test with.

clip_image012

Check whether SQL jobs are disabled

There are chances that the scheduled jobs are disabled in SQL. To check and enable the jobs follow the steps below.

1. In SQL Server Management Studio, connect to the SQL instance for DPM and run the SQL query mentioned in the previous section to find the list of scheduled jobs.

2. Expand the SQL Server Agent and then Jobs. Compare the jobs listed there with the output from the SQL query run in step 1. If a job from the query shows up as Disabled (with an arrow pointing down) then right-click on the job, click Enable and then run the job manually from SQL by following the steps mentioned in previous section.

clip_image014

 

Check the logon account credentials

DPM enters the SQL Agent account name into the registry, and later DPM checks that account each time the DPM engine launches. The internal interfaces to DPM are secured using this account so the account name needs to match the account the SQL Agent is using.

NOTE The account used by SQL Agent/SQL Server services for the SQL instance that hosts DPMDB should be a local account (mostly MICROSOFT$DPM$Acct (or) NTAuthority\System). If these services are configured to run under a domain service account then check whether there is any specific reason as to why those were configured to use domain account. The scenarios where it would require a domain account for SQL services include the following:

a) Remote SQL Server: DPM is configured to use a remote SQL Server instance to host its DPMDB database.

b) Library Sharing is enabled: Check whether library sharing is enabled or not. If not, then change the account to local account at both place (SQL Services and Registry keys mentioned above) or change the registry key values to match domain account used by SQL services, depending on the situation.

Follow the steps mentioned below to verify account information and make changes as needed:

1) Check logon account configured for following services of SQL instance for DPM:

– SQL Server (InstanceName)
– SQL Server Agent (InstanceName)

2) Check the values in the following registry key and verify whether the values are different. Update the values to reflect the user account being used for the SQL Agent Service.

HKLM\Software\Microsoft\Microsoft Data Protection Manager\Setup

SqlAgentAccountName

SchedulerJobOwnerName

The following KB article contains steps to verify the SQL accounts in the registry:

2930276 – System Center 2012 R2 Data Protection Manager upgrade fails and generates ID: 4323: “A member could not be added” (http://support.microsoft.com/kb/2930276/EN-US)

3) Restart the SQL Agent and the SQL Server services after changing the account information in the registry.

4) On the DPM server, select the protection group, click Modify on the ribbon at the top, then complete and update the protection group without making any changes. This step is necessary in order to re-generate the jobs in SQL with the updated account information.

5) If you are using an account other than the Microsoft$DPM$Acct service account, update DCOM launch and access permissions to match what was granted to Microsoft$DPM$Acct.

To do this, launch DCOMCNFG.exe from a command prompt, then navigate to Component Services -> Computers -> My Computer -> DCOM Config -> Microsoft System Center Data Protection Manager 2010 Service. Right-click the service name and select Properties. Choose the Security tab and select Edit in the Launch and Activation Permissions area. Now add the new account and give it all permissions.

Check Permissions
Check permission on following folders (where triggerjob.exe is located), as applicable:

1. DPM Server: C\Program Files\Microsoft System Center 2012 R2\DPM\DPM\bin

2. Remote SQL Server: C:\Program Files\Microsoft Data Protection Manager\DPM2012R2\SQLPrep

The DPM service account, Microsoft$DPM$Acct, or the account used per the previous section (if the SQL server is remote) should have Full Control permissions.

Check the triggerjob.exe path on the remote SQL server

If you are using a remote SQL server instance for both DPM 2012 SP1 and DPM 2012 R2, then the DPM 2012 R2 SQL Prep overwrites the triggerjob.exe path on the remote SQL server for DPM 2012 SP1 and changes the path as shown below.

Before: %DPMInstall%\Program files\Microsoft Data Protection Manager\DPM2012\SQLPrep

After: %DPMInstall%\Program files\Microsoft Data Protection Manager\DPM2012R2\SQLPrep

This causes the SQL Agent to fail in finding triggerjob.exe when DPM 2012 SP1 scheduled jobs are run. If this symptom matches your scenario, simply re-run DPM 2012 SP1 SQLPrep to resolve the issue.

Please review a blog post here for additional details about this specific symptom.

Check network and firewall settings

If you are using multiple NICs and different networks for SQL/DPM and the Agent or host file on the SQL server is used to point to the DPM server, perform the following tests to rule out incorrect IP address or firewall settings:

1) Ensure that triggerjob.exe is in the path specified.

2) Run the triggerjob.exe command manually using the hostname and IP address of DPM server, each one in-turn, and check whether the command completes and invokes the DPM engine successfully.

3) Make sure DNS resolution is working properly and that a firewall is not blocking communications.

a. On the SQL Server, add a HOST file entry for the DPM Server name and IP address.

b. Add the following firewall rules on the DPM server.

· advfirewall firewall add rule name=”SMB for installation (TCP-139,445-In)” dir=in action=allow profile=any localport=139,445 protocol=tcp remoteip=agentIPAddresses

· advfirewall firewall add rule name=”SMB for installation (UDP-137,138-In)” dir=in action=allow profile=any localport=137,138 protocol=udp remoteip=agentIPAddresses

· advfirewall firewall add rule name=”RPC for DPM (TCP-135,5718,5719,49152-65535-In)” dir=in action=allow profile=any localport=135,5718,5719,49152-65535 protocol=tcp remoteip=agentIPAddresses,SQLIPAddress

Proactively monitor for scheduled job failures

You can setup an alert outside of DPM to monitor for SQL Agent Scheduler failures. For example, if you have System Center 2012 Operations Manager (OpsMgr 2012) implemented in your environment, you can configure it to monitor and generate alerts for warnings or errors raised by source “SQLAgent$MSDPM2012”, or you can specifically monitor for Event ID 208.

Conclusion

I hope the tips in this post help should you ever find yourself in need of troubleshooting this issue. And be sure to periodically check the status of recovery point jobs and their availability by reviewing recovery points on the Recovery task area in the DPM console to avoid any surprises.

Sekar Raju | Senior Support Engineer | Microsoft C&E Management and Security Division

Get the latest System Center news on Facebook and Twitter:

clip_image001 clip_image002

System Center All Up: http://blogs.technet.com/b/systemcenter/
System Center – Configuration Manager Support Team blog: http://blogs.technet.com/configurationmgr/
System Center – Data Protection Manager Team blog: http://blogs.technet.com/dpm/
System Center – Orchestrator Support Team blog: http://blogs.technet.com/b/orchestrator/
System Center – Operations Manager Team blog: http://blogs.technet.com/momteam/
System Center – Service Manager Team blog: http://blogs.technet.com/b/servicemanager
System Center – Virtual Machine Manager Team blog: http://blogs.technet.com/scvmm

Windows Intune: http://blogs.technet.com/b/windowsintune/
WSUS Support Team blog: http://blogs.technet.com/sus/
The RMS blog: http://blogs.technet.com/b/rms/

App-V Team blog: http://blogs.technet.com/appv/
MED-V Team blog: http://blogs.technet.com/medv/
Server App-V Team blog: http://blogs.technet.com/b/serverappv

The Forefront Endpoint Protection blog : http://blogs.technet.com/b/clientsecurity/
The Forefront Identity Manager blog : http://blogs.msdn.com/b/ms-identity-support/
The Forefront TMG blog: http://blogs.technet.com/b/isablog/
The Forefront UAG blog: http://blogs.technet.com/b/edgeaccessblog/