Troubleshooting Service Manager work item (Incident, Change Request, Service Request) status stuck on “New”

Any time we see newly created work items stuck with a “New” status, it generally means that the Service Manager workflows are not processing or are processing slowly. Monitoring the “Minutes behind” of each workflow can be a useful method of troubleshooting.

The following web page has good troubleshooting tips and a SQL query that can be used to display the “Minutes behind” of each workflow. A similar SQL Server Management Studio SQL query is also shown below:

-- Use ServiceManager
-- Select Name, is_broker_enabled from sys.databases Where name = 'ServiceManager'
-- Line above added because it needs to be 1 or some stuff will not run.  Confirm is_broker_enabled set to 1
-- Select above is remarked out because it is not directly related to the purpose of this blog posting.
-- SubscriptionStatus.sql    -- Workflow / subscription status 
Use ServiceManager
 DECLARE @MaxState INT, @MaxStateDate Datetime, @Delta INT, @Language nvarchar(3)
 SET @Delta = 0
 SET @Language = 'ENU'
 SET @MaxState = (
    SELECT MAX(EntityTransactionLogId)
    FROM EntityChangeLog WITH(NOLOCK)
 )
 SET @MaxStateDate = (
 SELECT TimeAdded 
 FROM EntityTransactionLog
 WHERE EntityTransactionLogId = @MaxState
)
SELECT
    LT.LTValue AS 'Display Name',
  S.State AS 'Current Workflow Watermark',
 @MaxState AS 'Current Transaction Log Watermark',
 DATEDIFF(mi,(SELECT TimeAdded 
     FROM EntityTransactionLog WITH(NOLOCK)
     WHERE EntityTransactionLogId = S.State), @MaxStateDate) AS 'Minutes Behind',
 S.EventCount,
 S.LastNonZeroEventCount,
 R.RuleName AS 'MP Rule Name',
    MT.TypeName AS 'Source Class Name',
    S.LastModified AS 'Rule Last Modified',
 S.IsPeriodicQueryEvent AS 'Is Periodic Query Subscription', --Note: 1 means it is a periodic query subscription
    R.RuleEnabled AS 'Rule Enabled', -- Note: 4 means the rule is enabled
 R.RuleID
 FROM CmdbInstanceSubscriptionState AS S WITH(NOLOCK)
 LEFT OUTER JOIN Rules AS R
    ON S.RuleId = R.RuleId
 LEFT OUTER JOIN ManagedType AS MT
    ON S.TypeId = MT.ManagedTypeId
 LEFT OUTER JOIN LocalizedText AS LT
 ON R.RuleId = LT.MPElementId
 WHERE
    S.State <= @MaxState - @Delta 
 AND R.RuleEnabled <> 0 
 AND LT.LTStringType = 1
 AND LT.LanguageCode = @Language
 AND S.IsPeriodicQueryEvent = 0
 /* to look at a specific workflow uncomment on of the following */
 -- AND LT.LTValue  LIKE '%Test%' 
 -- AND S.RuleId='1D74409B-B2D9-8C45-6702-AB8C94AA0694'  -- aka Display Name="New Change Request Workflow"'
 ORDER BY S.State Asc   

Troubleshooting Workflow Performance and Delays

We run the above SQL query many times waiting a few minutes between each execution to see how the “Minutes Behind” for each workflow changes. We scroll to the bottom of the list to determine the number of workflows in the normal range between executions. 2 minutes or less is normal:

  • is the “Minutes Behind” static
  • is the “Minutes Behind” only static for a few workflows.   It may be that the workflow is disabled, or that there is a management pack override disabling the workflow even though it shows up as enabled, or possibly it is a custom workflow that is not working properly.
  • Is the “Minutes Behind” continuously increasing for all workflows or only some of the workflows
  • Are all the workflows are impacted (greater than 2 minutes behind)
  • Are the “Minutes Behind” continuously increasing or does it go down on occasion.

The solution in this blog is intended to be used when 98% or more of the workflow “Minutes Behind” are static or continuously increasing over time.   If the workflow “Minutes Behind” is up and down as you execute the SQL query over and over then the troubleshooting steps in the web link above Troubleshooting Workflow Performance and Delays is more appropriate. Below is the list of common issues and solutions that we see from time to time on the Microsoft support lines when 98% or more of the workflow “Minutes Behind” are static or continuously increasing over time:

LIST OF ISSUES / SOLUTIONS:

– Most of the time the issue is resolved in a single minute by stopping the System center services on the Primary Management server, deleting Health Service Statefolder, and then restarting the services.  

 There are probably several causes however the most common is SQL server was restarted and the Service Manager Services Timed out trying to reach the SQL server. The following PowerShell steps can be used to reduce the time it takes to stop the services, delete the subfolder “Health Service State” and restart the services. The best way to prevent this problem is to put in a process to stop Service Manager services before applying updates to the SQL server and/or any other time that the Service Manager SQL server is restarted. After the SQL Service has been up and running for 5 minutes then restart the Service Manager services.

## Ideal stopping order:

Stop-Service HealthService ; Stop-Service OMCFG; Stop-Service OMSDK

Get-Service HealthService,omcfg,omsdk;

 ## You can use the following to open the Service Manager folder

## From the Service Manager folder delete or rename the "Health Service State" subfolder

$SMFolder = (Get-ItemProperty "HKLM:\SOFTWARE\Microsoft\System Center\2010\Common\Setup").InstallDirectory

Start $SMFolder

 ## ideal starting order (reversed from stopping)

Start-Service OMSDK ; Start-Service OMCFG; Start-Service HealthService

Get-Service HealthService,omcfg,omsdk;(Get-date).ToString()

– The “Microsoft Monitoring Agent” in Control Panel should not have any management server listed on the Service Manager primary management server, or other Service Manager management servers. If you have a server listed in the “Microsoft Monitoring Agent Properties” it should be removed and the option “Automatically update management group assignments from AD DS” should be unchecked.

If you want to monitor the management server please review the following document:

Microsoft System Center Management Pack for System Center Service Manager

Under “Mandatory Configuration” page 6

“…You should also ensure that the Service Manager management servers are configured for agentless monitoring…”

I have seen customers use it. Sometimes it works for a long time and then comes the hair pulling. Do not be tempted. Running the SCOM agent locally will on rare occasions cause unexpected behavior.

The following items below are unlikely to help if the “Minutes Behind” is only for some workflows, or if the “Minutes Behind” for the workflows is going down and up.

It is normal to have workflows with 0, 1, or 2 minutes. If “Minutes Behind” is going down then there is likely a SQL Load issue as mentioned in Troubleshooting Workflow Performance and Delays. If the “Minutes Behind” is not changing or increasing over time please review the possible solutions below:

– The workflows only run from the Service Manager Primary Management Server.   Execute the following SQL Query again the ServiceManager Data Base and confirm the Primary Management server name.   Is the server up and running?

-- Display the primary management server

Use ServiceManager

select DisplayName, [PrincipalName] from [MTV_Computer]

where [BaseManagedEntityId]=

(SELECT ScopedInstanceId

FROM dbo.[ScopedInstanceTargetClass]

WHERE [ManagedTypeId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterWorkflowTarget()

)

– Are the Services running on the Primary Management server? From an elevated Powershell prompt:

PS C:\> Get-Service HealthService,omcfg,omsdk
Status   Name               DisplayName
-----   ----               -----------
Running HealthService     Microsoft Monitoring Agent
Running omcfg             System Center Management Configuration
Running omsdk             System Center Data Access Service

– The “HKLM\SOFTWARE\Microsoft\System Center\2010\Common\MOMBins\Value1” registry value is required to connect to SQL database.   Also the encryption key in Value1 must match the SQL server database that it was generated from and the Management servers FQDN name. Meaning the computer name of the Service Management server and the domain that it belongs to cannot be changed.

– Is the Primary Management server listed in the SQL Health Service table in SQL?

-- Display Service Manager Management servers
Use ServiceManager
Select * from MT_HealthService

Is the primary management server listed in the MT_HealthService? If no then the Primary Service Manager Management Server Windows computer management object was deleted. Rare however sometimes customer accidentally deleted Windows Computer object for the Management server using Powershell or via Service Manager Console, “All Windows Computers”. If deleted via the GUI it should still exist until the items are cleared from Service Manager Console “Deleted Items”.   If missing promote a secondary Service Manager Management server to a primary Management server.   If no management servers are present in MT_HealthService SQL table then Service Manager Database must be Restored and existing tickets have to be recreated. Attempting to restore just the MT_HealthService table will not work.   Microsoft Development team has confirmed that when the Service Manager management Windows computer object is deleted many other interrelated changes occur to the Service manager database requiring the ServiceManager database be restored.

– If the password has been changed even if it has been changed back, reset the password in the Service Manager Console to see if it corrects the workflow problem.

Reset / retype the password of the Service Manager Workflow account stored using the following steps:

Service Manager Console > Administration > Administration > Security > Run as Accounts

Then double click the account the and type in the password

– Service Account Authentication problem or SCSM workflow account authentication problem:

Log Name:      Operations Manager
Source:        HealthService
Event ID:      7000
Level:         Error
Description:
The Health Service could not log on the RunAs account CONTOSO\SvcMgrWork for management group ServiceMgmtGroup.  The error is The user name or password is incorrect.(1326L).  This will prevent the health service from monitoring or performing actions using this RunAs account.

Log Name:      Operations Manager
Source:        HealthService
Event ID:      7000
Level:         Error
Description:
The Health Service cannot verify the future validity of the RunAs account CONTOSO\SvcMgrWork for management group ServiceMgmtGroup.  The error is The user name or password is incorrect.(1326L).

The causes can vary. Account has been deleted from Active Directory, Password has Expired, Account is disabled , time is greater than 5 minutes between systems causing a Kerberos authentication failures.   From the Service Manager primary management server you can run the following from an elevated Powershell prompt against the system event log and it might confirm a Kerberos problem. You may need to re-enter SCSM workflow account under Service Manager Console > Administration > Administration > Security > User Roles.

Get-WinEvent -Logname system | ?{$_.Message -like "*KRB_AP_ERR_MODIFIED*"}

The following can be used on different systems to determine if the UTC time is near the 5 minute difference, replacing DomainControllerServerName with the name or your DC:

w32tm.exe /stripchart /computer:DomainControllerServerName

– Check if the PID of the HealthService service is changing often. This would indicate that the service is crashing and then restarting.

Lastly, if you workflows start running properly “Minutes Behind” at 0, 1, or 2 minutes then new workitems should work as expected.   In some cases previous workitems may need to have the status reset with Powershell.

Search keywords:

Workitem status not updating

Workitem stuck on new

Workitem status not changing

Incident status not changing

Service Request status not changing

Change Request status not changing

  • Austin Mack, Sr. Support Escalation Engineer, Microsoft