We have seen the below issues with customers using PCHAT/Group Chat;
Issue 1: LYNC client would time and again display an error stating Persistent Chat lost connection to the Database.
Issue 2: Pchat servers Crash/Hang or loose connection to the SQL database Often
Issue 3: We see Event ID 53530 Fairly regularly on the PCHAT Servers with Error Description "The Persistent Chat database connection was lost"
Issue 4: We also see the following Warning Events on the Persistent Chat servers time and again,
Notice: Info 9/29/2014 10:38:29 AM customer.com domain is no longer busy and can send us incoming messages normally.
Notice: Warning 9/29/2014 10:38:13 AM customer.com domain is too busy to send us incoming messages.
Notice: Info 9/29/2014 10:38:01 AM customer.com domain is no longer busy and can send us incoming messages normally.
We have come across Multiple cases where customers with large AD infrastructure were seeing one of the above Issues/Symptoms on their PCHAT Servers.
Although some of these warnings self-correct themselves within a very short period of time we tried to focus on finding out why we see these warnings/errors. Because it seemed like every time these errors occur they would cause some or the other issue and the behavior was very Intermittent.
While working on the issue with help of SQL engineers we found that there are several Instances where we see SQL Deadlocks.
When we Investigated further we found that a Certain Stored procedure [dbo].[procProcessADUpdates] seems to be trying to read certain Tables in SQL for a span of about 5-10 minute and this causes the deadlocks.
We then tried to correlate to see if the time when the Deadlocks happen matches with the time we see the errors in the Event logs on the Persistent Chat servers.
To find this we ran SQL PSSDIAG only configured to log instances where this specific Stored Procedure [dbo].[procProcessADUpdates] ran and later analyzed the Results.
Our Priority was to establish a confirm cause and effect relationship between the Errors happening on Persistent Chat and Events taking place on SQL at the same time.
We were able to identified a set Pattern on when the Errors appear on the Persistent Chat server.
The errors happen for a continuous interval of 10 minutes and they do not happen after that for about 101-15 minutes, then they reappear and constantly happen for another period of 10 minutes.
We collected Event logs from the Persistent Chat server and Traces for ProcessADUpdate stored procedure on SQL server.
These errors happened on the Persistent Chat server several times.
The Stored Procedure ran on SQL several times.
The Traces on SQL were collected between 10 am and 4 PM
Between 10 am and 4 PM there were several times when the stored procedure ProcessADUpdate on SQL ran but did not complete
First Occurrence of Stored Procedure running on SQL
Start time End time
We saw the Event errors on PERSISTENT CHAT Server several times within this 10 minute window when the stored procedure was running between 10.12 am and 10.22 am.
If you see the Highlighted Section above you will see that the events happened between 10.13 am and stopped at 10.20 am. This is exactly the time when the Stored procedure started and stopped. After 10.20 am we did not have any events till 10.33 am.
The next error event was logged at 10.33 am and we kept getting these errors until 10.41 am
So we saw the errors again between 10.33 and 10.41 am and this is exactly the time when the stored procedure started again.
Start Time End Time
Below are all the times when the Stored procedure Ran on SQL
Duration Start time End time Object name
600022417 2015-01-29 10:12:09.080 2015-01-29 10:22:09.103 procProcessADUpdates
600013203 2015-01-29 10:32:26.223 2015-01-29 10:42:26.237 procProcessADUpdates
600104326 2015-01-29 10:52:44.880 2015-01-29 11:02:44.983 procProcessADUpdates
600015001 2015-01-29 11:12:56.007 2015-01-29 11:22:56.023 procProcessADUpdates
600022547 2015-01-29 11:33:15.580 2015-01-29 11:43:15.603 procProcessADUpdates
600022349 2015-01-29 11:53:29.437 2015-01-29 12:03:29.460 procProcessADUpdates
600024296 2015-01-29 12:13:33.150 2015-01-29 12:23:33.173 procProcessADUpdates
600058078 2015-01-29 12:34:17.830 2015-01-29 12:44:17.887 procProcessADUpdates
600044520 2015-01-29 12:54:58.053 2015-01-29 13:04:58.097 procProcessADUpdates
600045646 2015-01-29 13:15:04.677 2015-01-29 13:25:04.723 procProcessADUpdates
600071117 2015-01-29 13:35:07.493 2015-01-29 13:45:07.563 procProcessADUpdates
600039783 2015-01-29 13:57:46.727 2015-01-29 14:07:46.767 procProcessADUpdates
600014348 2015-01-29 14:18:06.810 2015-01-29 14:28:06.823 procProcessADUpdates
600014660 2015-01-29 14:38:25.600 2015-01-29 14:48:25.613 procProcessADUpdates
600018793 2015-01-29 14:58:44.330 2015-01-29 15:08:44.347 procProcessADUpdates
600008529 2015-01-29 15:18:51.800 2015-01-29 15:28:51.807 procProcessADUpdates
600011609 2015-01-29 15:39:00.673 2015-01-29 15:49:00.683 procProcessADUpdates
And on all these times we saw errors on the Persistent Chat servers exactly.
This gave us a Proof that the running of this stored procedure to some extent is directly proportional to the Error Events on the FE server in OUR case..
Which then led us to find out more information on the Stored Procedure that was causing this issue,
About sproc ‘inprocProcessADUpdates’
In the Persistent Chat server, an ActiveDirectoryWatcher’s job is to periodically (default is every 10 minutes) query the AD for any updates and if there are any, the corresponding principals are then updated in the Lync DB and also updates the AD sync cookie. The sproc ‘inprocProcessADUpdates’ finds the principals that were recorded as out of sync with AD and updates the data in the Lync database. The default timeout given for executing this SPROC is 10 minutes which is hardcoded in the code. By ‘hardcoded’ I mean this value cannot be customized. While working with the Product developers we learnt that “this command could take a long time to process based on the complexity of the data”. This is the Reason why a Time Out value was configured to ensure the SPROC does not keep running forever. The Time Out Value as mentioned is 10 minutes. It is NOT recommended to change the Time Out value of this SPROC. However we do Have the ability to Change the Frequency of its Run, meaning we can change how often we want it to Run. Based on the code this can be done by modifying the AD sync frequency timer by setting the adsynchfreq attribute in the server’s configuration file (.exe.config).
About sproc ‘procGetDirtyPrincipalIds’
The sproc gets a list of all principals that are dirty (principals that are out of sync with AD) in the Lync DB and need to be refreshed from AD. It gets batch of dirty principals from the DB at a time and updates those. Default batch size is 20 but is configurable by setting “adupdate/batchsize” in the config file for the server.
Based on the above information about these Stored procedures we learnt the following,
- There is no way to modify the time out time of the stored procedure ‘inprocProcessADUpdates’
- We do have the ability to modify how often it runs.
If you are Witnessing the same symptoms as described in the beginning of this Article under Symptoms. Then It is possible you are also running into this problem.
In that case We would recommend the following.
Monitor the Run Cycle of the Stored Procedure [dbo].[procProcessADUpdates]
To confirm that this Stored Procedure is creating an issue intermittently in your PCHAT Environment you can engage a SQL engineer and Run PSSDIAG for the Stored Procedure [dbo].[procProcessADUpdates]
You can run this for the entire day. At the end of the day Note down The Total number of times the SPROC ran and then find out How many times out of that did it Time out at the 10 Minute Mark.
You will be able to see the Exact start and Stop times of the SPROC and for the instances where it Ran for entire 10 Minutes and timed out, Check the Application Event logs on the PCHAT server to see if you had any warnings/error events relate to database connectivity being lost/unavailable at the same time.
Even If you don’t see Errors/warning in the Event logs but you do see that the SPROC ran Multiple Times and timed out after 10 minutes on most occasions then I would still consider this to be a problem that can cause database connectivity issues, Performance issues etc.
Chances are that in a Large AD infrastructure you may encounter this scenario where this Particular SPROC does not complete its execution in 10 minutes and hence times out. The problem is That every time this SPROC runs it Locks certain SQL tables and these tables become unavailable for that amount of time until the SPROC finishes. As a result if some other Application/Sproc tries to access these tables at the same time it will NOT be able to and this will cause Intermittent connectivity issues. You may not be able to find a way to speed up the SPROC such that it completes every time within 10 Minutes or Never times out. Hence to Solve this issue one approach you can use is to Modify the Frequency of How Often this SPROC should run.
By Design This SPROC is programmed to Run Every 10 Minutes. Since this is a Very Intensive Operation it is Ideal NOT to run this SPROC every 10 Minutes specially in a Large AD environment.
So in Short, If you find out that This SPROC Runs Multiple times in your Set up and Does NOT complete every time successfully but Times OUT often then It is Advisable to Change the Frequency of how Often it Runs.
To do this you can follow the Below Instructions,
Modify the Frequency of the SPROC [dbo].[procProcessADUpdates]
Stop LYNC Services on the PCHAT Server
Modify the Frequency of the Stored procedure
On the SQL server hosting the Persistent Chat Database (MGC)
Expand the MGC database
Right Click dbo.tblConfig and select edit top 200 Rows
On the results window under ConfigContent you will see a Long XML file
In this section Modify the Attribute (numerical value) between <adsynchfreq> 10 </adsynchfreq> In your Environment this value will be set at 10 change it to 480 this will mean that the Stored Procedure will run every 8 hours.
[NOTE: There is NO Set Guidance on what this Value should be changed to. You have to make this decision based on your Requirement. Every time you make a change in your Environment related to PCHAT (for example - add a user to Room, Create a Room etc.) This Value Decides how soon will that change take effect. If you do not make changes that often to the PCHAT Settings than it is best to keep this Value as high as possible to avoid Unnecessary running of the Stored Procedure as it is a very intensive Operation. In My example I have set the duration to 480 minutes which is 8 Hours]
Once you make this change Exit out of SQL Server Management studio and re-open it and verify the change stays.
Refer Screen Shot below for reference
Start LYNC services on the Persistent Chat Servers
Once the Above changes are made you would need to Restart the Persistent Chat services on the Persistent Chat server that is currently Live in your environment.
You can either Restart the Persistent Chat server itself or just restart the Persistent Chat services by running the following commands from the windows powershell
Note Down the EXACT Time when the Server/Services are restarted. (Very Important)
This is the time that will be used as the Reference point by the Stored Procedure to decide when should it run Next.
For Example If you set the <Adsynchfreq> to 480 (8 hours) and you restarted the Server at 1 PM. Then the Stored Procedure will Run After 8 hours from this point onwards so the Next run will be at 9 PM.
I would recommend you restart the Persistent Chat service at a time that will make sure that the Sproc does not run during normal/busy business hours.
Repeat Steps 1-5 On all PCHAT Servers in your Environment.