If you are managing a hierarchy for sometime and dealt with Replication issues before; I assume you would have an understanding of the DRS Initialization process. If you feel you need some guidance then refer:
But what is very obscure and often called a Black box is the DRS Message flow and the synchronizations of groups when the site is ACTIVE.
So in short how is the flow in a day to day life when things are good. You create a package on the console on CAS and it replicates back to the Primary Site or vice versa.
How does this happen?
A 100 feet flow that we have mentioned before - When the data gets inserted in the given site we extract the changes via Change Tracking and then convert them into message and send it via Service broker.
Well, All is good when looking it in a high overview. At CSS, there is no high overview, we have to know the skin of the things and the internals.
For each replication groups much more needs to happen for them being to show as ACTIVE individually when the Site is ACTIVE.
So the million dollar question is After Init is finished and the site is ACTIVE, how do we monitor the Synchronization and changes of the groups?
Note: The below concepts are there to clarify the concepts of DRS. Do not change or run any modification queries in the production without any expert guidance. For any recommendations or issues, please open a ticket with Microsoft CSS.
DRS Sync Process – On the initiating site
Part 1 : To decide whether to send changes for a group
- RCMCtrl regularly calls the
THREAD_InvokeDrsSyn()function as a part of regular loop.
- That calls the SP
spDRSInitiateSynchronizationswhich has the logic where it checks the difference between the last send time and the current time is more than the synchronization interval of the group.
- Now if the difference is more, means we have to send the changes for this group which initiates the synchronization by initiating a dialog in the ConfigMgrDRSMsgBuilderQueue.
- The MessageType = 'DRS_StartMsgBuilder' is the Message Type which marks the beginning of the replication and is done for each replication group. Note that the message will only be put in the queue if there is no message.
for that group in the queue already. A message there already means the previous sync has still not completed.
- When the message is placed in the
ConfigMgrDRSMsgBuilderQueue, it invokes its Activation Procedure
- As a part of this procedure, We check if the sys.dm_exec_requests already has the context of the current replication group then we throw a message '<ReplicationGroup> is already being processed.'
- If there is no message present for the group in question (meaning there are no previous syncs pending ) then we call the main extraction routine.
EXEC dbo.spDRSSendChangesForGroup @ReplicationGroup
Part 2 : The main extraction process on the initiating site
Now this procedure sends the changes by extracting the changes and send them as a message. The process starts as follows [expanding on the flow chart steps] -
1. Checking for which groups to not send the changes:
It checks if the last two syncs for this Replication group has completed or not [SyncCompleteTime is NULL] . We will learn more about
SyncCompleteTime later as to how it gets updated. If not it puts it in the temp table
@SitesNotToSend and would not send the changes to those Sites for this replication group.
It will log the below in Vlogs –
'Not sending changes to sites <ReceivingSiteCode> for Replication group <ReplicationGroup> since last 2 syncs to these sites have not completed.'
Here is a sample of the TSQL code that does that –
-- get sites which have not returned a sync complete within the throttle window ;with ReplicationGroupSendHistory (ID, TargetSite, SyncCompleteTime) as ( select ID, TargetSite, SyncCompleteTime from DrsSendHistory with (nolock) where ReplicationGroupID = @ReplicationGroupID ) insert into @SitesNotToSend (SiteCode) select TargetSite from (select *, ROW_NUMBER() over (partition by TargetSite order by ID desc) TopRows from ReplicationGroupSendHistory) as Temp where Temp.TopRows <= @ThrottlingWindow AND SyncCompleteTime is null group by TargetSite having count(Temp.TopRows) >= @ThrottlingWindow if exists (select * from @SitesNotToSend) begin set @LogText= @SyncIDLogString + N'Not sending changes to sites ' + (select SiteCode + N', ' from @SitesNotToSend for xml path(N'')) + N' since last ' + CONVERT(NVARCHAR(12),@ThrottlingWindow) + N' syncs to these sites have not completed.'; exec dbo.spLogEntry @LogText=@LogText, @ProcedureName=@procedureName; end
2. We check if there are any sites for this replication group for which the dialog handle got changed and the previous sync has not completed for it.
If yes, we log the below entry in Vlogs:
‘The dialog handle for sites @PendingSyncSitesWithChangedDialog has changed and previous sync has not completed. Not processing sync for them during this cycle. Will try again in the next cycle.'
Here is the code which checks this:
-- Check for any sites for this replication group which had a new dialog and previous sync is pending insert into @PendingSyncSitesWithChangedDialog select SiteCode from DRS_MessageActivity_Send where LastSendResult=-20 and ReplicationID=@ReplicationGroupID and Active=1 and SiteCode in (select TargetSite from DrsSendHistory with (nolock) where SyncCompleteTime is null and ReplicationGroupID=@ReplicationGroupID); if exists (select * from @PendingSyncSitesWithChangedDialog) begin set @LogText= @SyncIDLogString + N'The dialog handle for sites ' + (select SiteCode + N', ' from @PendingSyncSitesWithChangedDialog for xml path(N'')) + N' has changed and previous sync has not completed. Not processing sync for them during this cycle. Will try again in the next cycle.'; exec dbo.spLogEntry @LogText=@LogText, @ProcedureName=@procedureName; end
LastSendResult -20 means the same that the DialogHandle changed. We have this code later which updates the same.
insert into @TargetSites select SiteCode from DRS_MessageActivity_Send where CurrentSendSyncID = @SyncID; while exists (select * from @TargetSites) begin set @SiteCode = (select top(1) SiteCode from @TargetSites); exec @Rval = dbo.spDRSGetDialogHandle @SiteCode=@SiteCode, @ReplicationGroupID=@ReplicationGroupID, @Handle=@SendDialogHandle output; if (@Rval != 0 and @Rval != 2) begin rollback tran; return 0; end -- If new dialog was created and there are pending sync for this group and target site, then do not sync for this group and site if (@Rval = 2 and exists(select * from DrsSendHistory with (nolock) where SyncCompleteTime is null and ReplicationGroupID=@ReplicationGroupID and TargetSite=@SiteCode)) begin set @LogText= @SyncIDLogString + N'The dialog handle for site ' + @SiteCode + N' has changed and previous sync has not completed. Not processing its sync during this cycle. Will try again in the next cycle.'; exec dbo.spLogEntry @LogText=@LogText, @ProcedureName=@procedureName; -- Store the fact that this happened in the LastSendResult update DRS_MessageActivity_Send set LastSendResult=-20 where ReplicationID=@ReplicationGroupID and SiteCode=@SiteCode; end
Here is the full key for LastSendResult in DRSSendHistory
-99: Snapshot isolation error
-20: Sync handle changed while previous syncs were not done
-10: LastSendVersion changed while sync was in progress
-3: Replication group needs re-init and invalid subscription is sent by message builder
-2: Received exception while sending DRS_SyncStart or DRS_SyncEnd
-1: Any other error during sync
3. Compiling the changes to be send:
We basically call spGet<table>Changes (“Site” replication pattern) or function fnGet<Table>Changes (“global” replication pattern) or function fnGet<Table>ChangesSec (“global_proxy” replication pattern) to extract changes.
Now if we have DebugLogging enabled for this Vlogs we do log useful information for when we extract the changes.
To enable DebugLogging we can alter the function like below –
ALTER FUNCTION [dbo].[fnIsDebugLoggingEnabled]() RETURNS BIT AS BEGIN RETURN 1; END
From Vlogs (Debug Enabled):
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Start extract and send of changes, pattern is global and group id is 3 SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Encoding and Compression is enabled for this replication group. SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Starting scan, group ID is 3 SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] - Extracted 1 changes for table CM_SiteConfiguration SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Extracted 1 total changes.
4. The extracted changes are saved into #SiteTrackingTable (“Site” replication pattern) or #TrackingTable (“global” or “global_proxy” replication pattern). For e.g. The ‘SoftwareInventory’ table being a site data table has SCCM_DRS.spGetSoftwareInventoryChanges stored procedure that will be called and it will extract the data to the #SiteTrackingTable.
5. Call sproc spDRSSendStartMsg to mark “starting message send”.
From Vlogs (Debug Enabled) on initiator site:
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Sending DRS_SyncStart message to 2 sites.
Once the Sync start message is processed on receiving site we log.
INFO: Received message DRS_SyncStart. INFO: Reading site code and replication group from XML message. INFO: Read site code CAS and replication group Configuration Data. INFO: Updating DRS_MessageActivity_Receive CurrentReceiveDialogHandle: 6f9c1d3c-9464-e911-80f1-001dd8b85882, LastReceiveStartTime: 06/01/2019 20:08:35 and LastReceiveResult: 0. INFO: Updated DRS_MessageActivity_Receive.
6. If there are changes in #TrackingTable, walk through #TrackingTable to build messages and call proc spDRSSendDataMsg to send out. If there are changes in #SiteTrackingTable and current replication group’s replication pattern is “Site”, walk through # SiteTrackingTable to build messages and call proc spDRSSendBinaryDataMsg to send out.
From Vlogs (Debug Enabled) on initiator site:
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Called spDRSSendDataMsg 1 times.
Once the SyncData message is processed on receiving site we log.
INFO: DRS_SyncDataCompressed message processing started. INFO: DRS_SyncDataCompressed needs decompression. INFO: DRS_SyncDataCompressed message decompressed from length 762 to 2376. INFO: Read site code: CAS from the DRS_SyncData. INFO: Parsed data from DRS_SyncData XML: lastVersionReceived: 4099493, lastSyncVersionToSource: 17479330, list of upsert tables: CM_SiteConfiguration, list of delete tables: INFO: Building dependency map for all upserts. INFO: Processed all upserts. INFO: Building dependency map for all deletes. INFO: Processed all deletes. INFO: Updating DRS_MessageActivity_Receive with LastVersionReceived: 4099493 for ReceiveDialogHandle: 6f9c1d3c-9464-e911-80f1-001dd8b85882. INFO: Updated DRS_MessageActivity_Receive. INFO: Processed message DRS_SyncDataCompressed
7. Call proc spDRSSendEndMsg to mark “ending message send”.
From Vlogs (Debug Enabled) on initiator site:
SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Sending DRS_SyncPing and DRS_SyncEnd message to 2 sites and updating DRS_MessageActivity. SYNC(B3BBC272-BDFC-4047-942F-538F1A0775A7) [Configuration Data] Completed operation in 1 seconds.
Once the SyncEnd message is processed on receiving site we log.
INFO: Received message DRS_SyncEnd. INFO: Reading last version received from XML message. INFO: Read last version received 4099494. INFO: Updating DRS_MessageActivity_Receive with LastVersionReceived: 4099494 and LastReceivedEndTime: 06/01/2019 20:10:12 for DialogHandle: 6f9c1d3c-9464-e911-80f1-001dd8b85882. INFO: Processed message DRS_SyncEnd.
Now it is important to note that just by sending a SyncEnd message doesn’t mark the completion of the Sync. The sync is not completed until the initiating site receives a SyncComplete message from the receiving site which sends it after it receives the SyncEnd message.
So when we send the SyncEnd message from the Initiating site we update the DRSSendHistory (Leaving the ProcessedTime and SyncComplete as NULL)
-- update sync history update DRS_MessageActivity_Send set LastSendEndTime=@EndTime, LastVersionSent= CASE WHEN @IsSchedulingOn = 0 THEN LastVersionSent ELSE @MaxVersion END, SkipContextCheck=0, SyncDataSize = SyncDataSize + @StartMsgSize + @EndMsgSize where ReplicationID=@ReplicationGroupID and LastSendStartTime=@StartTime and LastSendResult>=0 and SiteCode in (select SiteCode from @TargetSitesAndHandleTable WHERE RtmInterop = 1) update DRS_MessageActivity_Send set LastSendEndTime=@EndTime, LastVersionSent= CASE WHEN @IsSchedulingOn = 0 THEN LastVersionSent ELSE @MaxVersion END, SkipContextCheck=0, SyncDataSize = SyncDataSize + @StartMsgSize + @EndMsgSize + @PingMsgSize where ReplicationID=@ReplicationGroupID and LastSendStartTime=@StartTime and LastSendResult>=0 and SiteCode in (select SiteCode from @TargetSitesAndHandleTable WHERE RtmInterop = 0) IF @ReplicationPattern!=N'cloud' BEGIN exec dbo.spDrsInsertSendHistory @SyncID=@SyncID, @ReplicationGroupId=@ReplicationGroupID, @ChangeCount=@ChangeCount, @StartTime=@StartTime, @EndTime=@EndTime END
Note that the @SyncID for each sync of a replication group is created by the NEWID() inbuild function which creates the GUID. It remains the same for a current sync throughout. For the next sync it would be again created as a different GUID.
Note that the LastProcessedTime and the SyncCompleteTime are not set by the above highlighted procedure in the DRSSendHistory of the Sending table. If you check at the record it would be NULL for those two columns. So how do they get updated?
spDRSSendEndMsg sends the ‘DRS_SyncEnd’ messagetype with the @SyncId to the target site.
- On the Receiving site when the message comes in the ConfigMgrDRSqueue, It invokes the spDRSActivation procedure running as a part of MessageHandlerService which processes DRS_SyncEnd message and updates the DRS_MessageActivity_Receive table with last sync version received.
- The above part we already know but we also send a SyncComplete message as well to the Initiator site which then should mark the completeness of the Sync.
It is in ProcessSyncEnd() as well
spDRSSendSyncCompleteon the receiving site calls
EXEC dbo.spSendRcmServiceBrokerMessage @Msg, 'DRS_SyncComplete', 9, @SiteCode
- Note that the DRS_SyncComplete message is an ConfigMgrRCMQueue message and hence the receiving site sends the message to the sending site ConfigMgrRCMQueue.
ConfigMgrRCMQueuereceives the message and the activation procedure for the queue which is
Here is the logic that goes for processing the DRS_SyncComplete on the Initiating Site.
ELSE IF @ReceivedMessageType = N'DRS_SyncComplete' BEGIN DECLARE @SyncID UNIQUEIDENTIFIER; DECLARE @ProcessedTime DATETIME; SET @MessageXml = CONVERT(xml, @ReceivedMessageBody); SELECT @SiteCode = T2.X.value(N'./@TargetSite', 'NVARCHAR(3)'), @SyncID = T2.X.value(N'./@SyncID', 'UNIQUEIDENTIFIER'), @ProcessedTime = T2.X.value(N'./@ProcessedTime', 'DATETIME') FROM @MessageXml.nodes(N'DRS_SyncComplete') AS T2(X); EXEC dbo.spDRSUpdateSendHistory @SiteCode, @SyncID, @ProcessedTime END
- The final process is that spDRSUpdateSendHistory that simply updates the ProcessedTime and SyncCompletetime on the initiator site for the current sync as below
SET @ReplicationID = (SELECT DISTINCT ReplicationGroupID FROM DrsSendHistory WHERE SyncID=@SyncID); UPDATE DRSSendHistory SET ProcessedTime=@ProcessedTime, SyncCompleteTime=GETUTCDATE() WHERE TargetSite=@TargetSite and SyncID=@SyncID; UPDATE DRS_MessageActivity_Send SET LastSyncProcessedTime=@ProcessedTime WHERE SiteCode=@TargetSite and ReplicationID=@ReplicationID;
- So that finally marks a complete Sync and then a given replication group will be deemed as ACTIVE or else it wont be ACTIVE.
Part 3 : Tracking the Sync Handles in Action
It is very important to know how these handles relate to troubleshoot the issue. Because a mismatch here wouldn’t let a sync to complete.
What can cause the handles to mismatch?
- A snapshot restore to an old date DB
- Someone manually cleaning the SSB_Dialogpool when the current sync is in Progress.
So Lets try to track them for say suppose ConfigMgrDRS queue communications for NormalPriority message for Configuration Data.
On the initiator Primary site I check for the current handles for all conversations for Global data.
select * from SSB_DialogPool where ToService like 'ConfigMgrDRS_SiteCAS' and OnContract = 'NormalPriority' order by CreationTime desc
Now these are the ones equivalent to all the global groups which are allowed to send using Normal priority and we can confirm it here
select * from ReplicationData where ReplicationPattern = 'global' and ReplicationPriority = 5
Now how to find which one handle and groupid is being current used by the Configuration Data ?
select * from DRS_MessageActivity_Send where ReplicationID = 3 And Active =1
That’s the second one from the list above we first shared.
Now note that this GroupID is not unique across instances, meaning that we cannot search for the same group id on the receiving site.
There is in fact a different id which is unique
[Conversation_ID] and we need to find the corresponding
[Conversation_ID] for this group id.
select * from sys.conversation_endpoints where conversation_group_id = 'EB2EE470-A345-4CC2-8F27-73C84A3A7629'
Now we can take this conversation_id and search on the receiving site.
select * from sys.conversation_endpoints where conversation_id ='AD90F806-A1AB-41B9-9130-8DD2473C16DA'
From here we get the conversation_handle and this should match with the handle stored in the DRS_MessageActivity_Receive from what we have learned above for things to work.
select * from DRS_MessageActivity_Receive where replicationid = 3 AND SiteCode = 'PR2'
And there we go it perfectly matches and so it means sync completed just well.
So now it should be clear that when sync starts and we use a conversation handle, we store it in the DRS_MessageActivity_Receive table and if someone flushes the dialogpool (manually or restoring the snaphot of old db) then this breaks this sync and the handle stored here would not match the new handle.
And we will as a result we uncompleted sync forever in DRSSendHistory where ProcessedTime and SyncCompleteTime would never populate.
There are ways to get out of this state which RLA reciprocates with (
exec spFlushSSBDialogPool) which has the logic to get out of this state but don’t run this manually and always use RLA.
If this doesn’t fix the issue then engage Microsoft CSS to deep dive into this who might run some manual queries to get you out of this situation.
Support Escalation Engineer | Microsoft System Center Configuration Manager
Disclaimer: This posting is provided "AS IS" with no warranties and confers no rights.