Change Tracking (aka Syscommittab) Issues and Cleanup - Part 1


Replication in Configuration Manager is talked about pretty frequently. It's a pretty awesome feature, but also a pain at times. One of the reasons I find it to be so cool is that it takes advantage of a SQL feature called Change Tracking. This feature is the fundamental basis for replication - and is a SQL feature rather than a CM feature. There's so much to talk about when it comes to change tracking that I doubt this will be the only blog I write about it. So, stay tuned for additional posts on the subject.

The purpose of this post isn't to explain completely how change tracking works, but a bit of an overview is probably necessary:

Change Tracking Overview

When change tracking is enabled every change is recorded and kept for a defined period of time. These changes are stored in two different tables. There is a change table for the specific table (on which change tracking is enabled) which will contain the details about what has been changed. These tables are referred to as "side tables" and have a naming convention of "change_tracking_<the table's object_id>". The second place where the changes are stored is in a table which contains all the changes for all tables. This table doesn't contain the details but rather a pointer to the side tables of the changes. This table is called "syscommittab". Note: a DAC connection is required to look into these internal tables unless all you're looking for is a row count.

The side tables and syscommittab need to be cleaned up routinely based on the retention period defined for change tracking. SQL takes care of this cleanup as part of background tasks/maintenance. Specifically, (as far as I know), there is an automatic background thread which triggers every 30 minutes to cleanup the side tables. When this is kicked off it will delete up to 5,000 records at a time. The syscommittab records aren't handled by the same background task but rather when a CHECKPOINT is issued. When this is triggered up to 10,000 records will be deleted at a time. This means that SQL takes care of all the storing of changes as well as maintaining the cleanup and neither you nor CM needs to do any maintenance. However...that isn't entirely true. That's how it's supposed to be, but alas, it isn't.

The Issue

For some reason (or reasons?), there are times when the cleanup just doesn't happen. Or, if it happens not enough is performed. You may see this as not really a big deal because who cares if a table is a little larger or storing data longer than we really want to right? Well, the problem is, if these tables get large and aren't cleaning up correctly (thus they keep growing) replication can be greatly impacted, and that means the entire system will be impacted negatively. Replication (DRS/RCM) uses these tables to determine the changes which need to be replicated or sent to the other sites. So when the change tables are really big this process takes longer. And, because these same tables must be written to every time there is a change and the system is still writing changes - from replication information from other sites as well as processing the data from clients/MPs/etc - there is more contention. Writes take longer. Reads take longer. And, Deletes take longer...if they happen at all. So, the whole system is impacted and many times this is the start of a spiral - since the tables just get worse and worse so do the reads and writes.

If you have a hierarchy and haven't hit this issue then count yourself lucky, but you should at least know about the potential issue if not plan for it. If you're in the 100k+ club then you more than likely have this issue even if you don't know it. And, if you have hit this issue then I'm guessing you're reading this in hopes that I have an answer. Well, I believe I do. Actually, I think I have multiple answers, but I'm just going to talk about one solution in this post.

In my experience I've not had much of an issue with the side tables not getting cleaned up, but rather with syscommittab. So, this solution is specifically addressed at syscommittab.

The SQL team added a stored procedure which can be used to manually cleanup syscommittab in the same way a CHECKPOINT does. This sproc is named "sp_flush_commit_table_on_demand" and was a promising addition to being able to manage syscommittab. The problem is...it only works sometimes. Yeah, that's right. After looking into this over and over it looks as though whatever issue there is with the CHECKPOINT logic also exists in this stored procedure. And at this point, it looks like the issue is with the "safe_cleanup_version" that is returned in the sproc.

When this sproc is run it gets a couple of values (cleanup versions) and then starts to cleanup syscommittab based on those values. The output/results of running this sproc should look something like this (Note: you can provide a number of records to cleanup or leave that off to clean all records, in this example I used "50" to just cleanup 50 records):

As you can see, it did in fact cleanup 50 records as I asked it to. The reason for that is because it was able to return a value for the "safe_cleanup_version" (and there were records that needed cleaning up). However, the majority of the time (in my experience anyway) this sproc doesn't return a value for the safe_cleanup_version and instead returns "0" and therefore doesn't do anything regardless of how many records are needing cleanup. This is generally what I see returned (especially when there is an issue with syscommittab):

I've got a case open for this so hopefully there will be a fix to this in the future which would make manual intervention unnecessary. But, until then I'll be using manual methods to make sure it's cleaned up. I mentioned above that I have more than one way of taking care of the cleanup - a while back CSS wrote a stored procedure which I then modified and began using. I recently discovered an issue with the procedure's logic for cleaning up - I had misunderstood (and apparently CSS did too) how the cleanup versions (or watermarks) were supposed to be used in the cleanup. So, the sproc doesn't work as often as it should or as well as it should. I've started to fix that script and perhaps will share that in a future post when it is completed.

The Solution

So what's this post's solution? Well, it's to use the internal stored procedure to perform the cleanup by running the sproc as many times as it takes until it finds the safe_cleanup_version and performs the cleanup. Pretty simple really. However, in my experience, it can sometimes be very difficult to get the safe_cleanup_version to return anything but "0". Therefore, I wrote a PowerShell script to execute the sproc until it does return a value and cleans up.

I am using PowerShell because that was the easiest way I could think of to programmatically check if a value was returned and to capture the number of "row(s) affected". This is because the lines returned by the sproc are the results of PRINT commands and the "row(s) affected" line(s) only exist if NOCOUNT is off. For example, here's the result of the sproc when NOCOUNT is explicitly turned off (compare to the previous screenshot):

Notice how there are no lines for "row(s) affected"? That's because that isn't something displayed due to the sproc but rather the system - the record count from the statement completion. I don't know of an easy way in SQL to capture PRINT statements or the rows affected from a stored procedure (you don't write/modify yourself that is). But, by using PowerShell and adding some event handlers I can grab both of these things. And, since I have to capture and parse the output to determine if I need to try running the sproc again or not this was the simplest solution I thought of. And, the greatest benefit of this is that it still uses the internal SQL sproc to perform the work - so it's the recommended way to perform syscommittab cleanup! Lastly, this doesn't require you to be connected to the DAC so that's another added benefit.

One thing to know about this internal sproc is that it only deletes a maximum of 10,000 records at a time - so if there are a million records to cleanup it will delete them in batches of 10 thousand (as long as you haven't told it to delete less).

The script (for automation)

The script is parameterized so you should be able to use it as is just by passing in your values/desires. Here's a quick review of the parameters and their explanations:

Parameter Mandatory? Default Value Notes/Explanation
ServerName Y None Provide the SQL Server name (and Instance Name if not a default instance)
DatabaseName Y None Provide the Database Name
LogDirectory N The directory where the PS script is located The directory in which to write the log file
LogFileName N SqlSysCommitTabCleanup.log The name of the log file. This must end with '.log'.
LogTableName N DBA.dbo.CommandLog The 3 part name of the CommandLog table (i.e.: database.schema.table)
MaxIterations N 1800 The number of times to try to cleanup before stopping the script (waits a second between trials); 1800 = 30 min (if all trials don't find a safe cleanup)
ConnectionTimeout N 120 (2 min) How long to wait for a connection timeout (in seconds). Note: The max is 20 minutes.
CommandTimeout N 172800 (48 hours) How long to let the SQL script execute before stopping (in seconds).
TotalRecordsToDelete N Null (aka, delete all) The total number of records to delete. Note: if this is not specified then all records will be deleted
LogToFile N TRUE If this is set to true then a log file will be created and logged to
LogToTable N TRUE If this is set to true then summary info will be logged to a table
VerboseLogging N FALSE If this is set to true then Verbose information will be output

The first thing I want to call out is that if you are going to log to a table you need to make sure to update the parameter "LogTableName" to the full name of the table in which you want to log. The problem with this is that even if you properly enter a logging table the script will only work if that table contains the following fields (and doesn't require any other values):

Field datatype Allow Nulls? Notes
ID int No Must be unique; best if this is the primary key and/or an identity column
DatabaseName sysname Yes
ExtendedInfo xml Yes
Command nvarchar(max) No
CommandType nvarchar(60) No
StartTime datetime No
EndTime datetime Yes
ErrorNumber int Yes
ErrorMessage nvarchar(max) Yes

This is the case because I'm assuming you're using at least one of Ola Hallengren's solution for maintenance activities and that you have his "CommandLog" table installed. If you don't you can download just the table definition at https://ola.hallengren.com/scripts/CommandLog.sql or create your own table with the above fields (and definitions).

The other thing I'll mention about the script is that if it starts to cleanup and is deadlocked the records that were cleaned up are still counted as the total records deleted AND the script handles the deadlock and keeps trying. I chose to only allow it to hit 5 deadlocks before stopping the script. So, if it runs and hits a few deadlocks that is okay, it'll keep cleaning up and will even tell you how many times it was deadlocked as well as provide the total count of deleted records!

An example of how to call this PowerShell script is:

.\CleanupSyscommittabWithInternalSproc.ps1 -ServerName "SomeSqlSrvr" -DatabaseName "CM_123" -VerboseLogging $true -LogTableName "SomeDB.dbo.CommandLog"

The thing I love about this solution is that you can create a SQL job or a scheduled task to call this script and just check your logged records for information rather than run it manually from time to time.

I need to give a shout out to Sherry Kissinger and Brian Mason for their willingness to help test this out. And, as you'll notice in the script history, Sherry is responsible for adding the initial logging to file logic. Thanks Sherry and Brian!! (you can read their blogs at https://mnscug.org/blogs/sherry-kissinger and https://mnscug.org/blogs/brian-mason).

The script can be found here: https://1drv.ms/u/s!AmzqhVted60bxFdjcJUTVPJQY-bi

SQL for the table log

If you have the CommandLog table and are logging to the table you can use the following query to look at the log (and check the values stored in the xml). Just comment out or uncomment out the fields you're interested in.

SELECT  TOP 5 ID

       ,StartTime

       ,EndTime

       ,ErrorNumber

       ,ErrorMessage

       --,ExtendedInfo.value(N'(/CleanupInfo/ScriptStart)[1]',N'datetime') AS [ScriptStart]

       ,ExtendedInfo.value(N'(/CleanupInfo/SysCommitTabRecordsAtStart)[1]',N'bigint') AS [SysCommitTabRecordsAtStart]

       ,ExtendedInfo.value(N'(/CleanupInfo/SysCommitTabRecordsAtEnd)[1]',N'bigint') AS [SysCommitTabRecordsAtEnd]

       ,ExtendedInfo.value(N'(/CleanupInfo/TotalSysCommitTabRecordsDeleted)[1]',N'bigint') AS [TotalSysCommitTabRecordsDeleted]

       --,ExtendedInfo.value(N'(/CleanupInfo/RowsToDeletePerIteration)[1]',N'int') AS [RowsToDeletePerIteration]

       ,ExtendedInfo.value(N'(/CleanupInfo/SafeCleanupVersion)[1]',N'bigint') AS [SafeCleanupVersion]

       ,ExtendedInfo.value(N'(/CleanupInfo/NumberOfDeadlocks)[1]',N'smallint') AS [NumberOfDeadlocks]

       ,ExtendedInfo.value(N'(/CleanupInfo/NumberOfIterations)[1]',N'bigint') AS [NumberOfIterations]

       --,ExtendedInfo.value(N'(/CleanupInfo/DeadlockMessage)[1]',N'nvarchar(2000)') AS [LastDeadlockMessage]

       ,ExtendedInfo.value(N'(/CleanupInfo/DeadlockTime)[1]',N'datetime') AS [LastDeadlockTime]

       --,ExtendedInfo.value(N'(/CleanupInfo/EndTime)[1]',N'datetime') AS [EndTime]

       --,ExtendedInfo.value(N'(/CleanupInfo/CheckRunnabilityIssues)[1]',N'varchar(2000)') AS [CheckRunnabilityIssues]

       ,ExtendedInfo.query(N'/CleanupInfo/InputParameterValues') AS [InputParameterValues]

       --,ExtendedInfo

  FROM DBA.dbo.CommandLog

 WHERE CommandType = N'SYSCOMMITTAB_POSH_MAINTAINER'

 ORDER BY ID DESC;

Here's a sample of a recent log entry from the query above (this doesn't show all columns as this is just a sample of the important info). As you can see, it had to run the sproc 66 times before it was able to complete the cleanup. And...it encountered 1 deadlock so I guess it technically was able to cleanup twice out of the 66 tries.

Here's a sample of the final output in the "host":

And, here's a sample of the final output in the log file (in cmtrace of course):

Hope this helps!

Comments (0)

Skip to main content