Getting 31552 ->Exception 'SqlException': Timeout expired very frequently in SCOM server

In the SCOM server we might start Getting 31552 ->Exception 'SqlException': Timeout expired very frequently and We will also see that the SQL server hosting the database role of the SCOM server going high on utilization of resources  and remaining for a long time.

Event Type: Error

Event Source: Health Service Modules

Event Category: Data Warehouse

Event ID: 31552

Date: 7/20/2009

Time: 1:33:04 AM

User: N/A

Computer: <RMS SERVER NAME>

Description:

Failed to store data in the Data Warehouse.

Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

One or more workflows were affected by this.

Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance

Instance name: Client Monitoring data set

Instance ID: {ED9228A4-9CAD-5782-CA20-7067D119EAFA}

Management group: <Management GROUPNAME>

For more information, see Help and Support Center at <https://go.microsoft.com/fwlink/events.asp>.

 

This can happen if the alert are not processed and is on the staging table. In order to overcome this issue we can do the following.

IMP Note: As there are database changes make sure that you have a valid database backup before you go ahead and perform these actions.

Run the following quries

select count(*) from Alert.AlertStage

select count (*) from Event.eventstage

select count (*) from Perf.PerformanceStage

select count (*) from state.statestage

The count on these tables will have count increasing and decresing. IF you find that any particular table is having huge count in lakhs like the one I had for the alert stage table. Follow the following.

In my case I had around 16 lakh + rows in alert stage table which will never get processed. So what I did is backup of the table and cleared the backlogs. For the same used the following

 

SELECT count(*) from ALert.AlertStage

SELECT * INTO ALert.AlertStage_backup FROM ALert.AlertStage

SELECT count(*) from ALert.AlertStage_backup

TRUNCATE TABLE ALert.AlertStage

SELECT count(*) from ALert.AlertStage

 

 

Then to make sure that the alert data is processing we ran the query

exec StandardDatasetMaintenance @DatasetId=' 6AF799BD-2CCC-41CF-97FD-058E2CBF9248'. This GUID was available for the alert data when we ran the query select * from StandardDataset

Query Result.

============

6AF799BD-2CCC-41CF-97FD-058E2CBF9248 Alert 0 0 1 AlertProcessStaging 01:00 240 2009-07-20 11:17:19.690

6F9FBA1B-EAE3-4071-AFA9-2101A81AA463 Event 0 0 1 EventProcessStaging 01:00 240 2009-07-20 11:17:16.033

06268CB1-F9C4-4195-AA67-7BB2EFD5224D State 0 48 1 StateProcessStaging 01:00 240 2009-07-20 11:16:29.797

7EAE1679-17F5-4BA6-B685-9867A88C00E1 CM 0 28 1 AemProcessStaging 00:00 240 2009-07-20 11:17:25.017

C7E7D06C-6D00-46F2-819D-CB7D206EC361 Perf 0 48 1 PerformanceProcessStaging 01:00 240 2009-07-20 11:17:25.203

If we have any error when we run exec StandardDatasetMaintenance @DatasetId=' 6AF799BD-2CCC-41CF-97FD-058E2CBF9248'. Then need to troubleshoot on that error which the results in the query windows displays. Else we can try running the count query (SELECT count(*) from ALert.AlertStage) and make sure that the data is coming in and going out as well.

 

While running the exec StandardDatasetMaintenance @DatasetId=' 6AF799BD-2CCC-41CF-97FD-058E2CBF9248', check if we are getting errors . If we have are getting error
“Sql execution failed. Error 8115, Level 16, State 1, Procedure -, Line 1, Message: Arithmetic overflow error converting IDENTITY to data type int.”
Then follow these steps of backing up the alert staging table

Modify the IDENTITY values for the Alert_GUID table by the running the following query, since it was well beyond 800 million
For correcting this

 

Select max(AlertRowId) from Alert_GUID

 

This will give you a value . Please note this value.

 

DBCC CHECKIDENT(‘Alert.Alert_GUID’,RESEED,<maximum value of the result in the previous query + 1>)

Again try to run the stored procedure and ensure its success.

 

Once the same is done you can see that the CPU utilization will also come down as well as the frequency of 31552 errors.Now if you want to put back this data you can do it bit by bit using the following quires.

WHILE (SELECT COUNT(*) FROM ALert.AlertStage_backup) > 0

BEGIN

INSERT TOP(1000) INTO ALert.AlertStage

([AlertGuid],[AlertProblemGuid],[ManagedEntityRowId]

,[AlertName],[AlertDescription],[Severity],[Priority]

,[Category],[WorkflowRowId],[MonitorAlertInd],[DateTime]

,[RaisedDateTime],[SiteName],[RepeatCount] ,[AlertStringGuid]

,[ParameterHash],[DBCreatedDateTime],[DWCreatedDateTime],[DWLastModifiedDateTime])

SELECT

[AlertGuid],[AlertProblemGuid],[ManagedEntityRowId]

,[AlertName],[AlertDescription],[Severity],[Priority]

,[Category],[WorkflowRowId],[MonitorAlertInd],[DateTime]

,[RaisedDateTime],[SiteName],[RepeatCount] ,[AlertStringGuid]

,[ParameterHash],[DBCreatedDateTime],[DWCreatedDateTime],[DWLastModifiedDateTime]

FROM ALert.AlertStage_backup

DELETE TOP(1000) FROM ALert.AlertStage_backup

This will take a huge time depending on the number of rows.Once processed please drop table FROM ALert.AlertStage_backup.

 

What this will do it will put 1000 alerts at a time to the backup table to the staging table to process. And the completion of this will take time depending on the number of backlogs.

If alert data is not so important for you can drop this table ALert.AlertStage_backup.

The same can be checked for other (event, performance) staging tables as well.