We Recently came across an issue where calls to ONLY certain Response groups were failing with an Error "500 Internal Server Error" while calls to other response groups would work. This articles may help you understand the one of the causes of this issue and the corresponding solution.
To Understand why we see this issue we have to look into how RGS works and how does it gets initialized.
A response Group has a few important components that are related to it,
Every Response group has a Contact Object that gets created for it in Active Directory
This Contact Object Has two specific attributes that are very important, The Attributes are
MSRTCSIP-OwnerURN - "urn:Application:RGS" This Tells a FE server that this Object is actually a Response Group,
MSTRCSIP-UserEnabled - "TRUE" --> This Tells the FE server that This Object is actually Enabled and a Valid Object.
Every Response group also has a WORKFLOW assigned to it, this dictates what actions to take or what options to present to the caller whenever this RG is being called.
So Every Response group in your Organization will have a Contact Object in Active directory and the above two attributes will be populated for them. In addition, Every RG will have a workflow associated with it.
Now Whenever You Restart the RGS services on any FE server, or if you Restart the FE server Itself. When it is starting the RGS service will do the following tasks;
Task 1 - It will first reach out to Active directory and find all the Contact Objects that have MSRTCSIP-OwnerURN attribute set to "urn:Application:RGS" This basically means that FE tries to find how many response groups do you have.
Let's Assume in your Organization you have 10 Response Groups created. In that case when RGS service is starting it should find 10 contact objects in your AD that have the MSRTCSIP-OwnerURN attribute set to "urn:Application:RGS"
Task 2 - Once it finds all the Entries the next thing a FE is supposed to do is Find out the corresponding Workflows for these 10 Response Groups in order to create a 1:1 mapping of Response group and its workflow. We call this Successfully registering Endpoint.
In this case in a perfect world When the RGS service starts on the FE it should be able to detect that there are 10 Contact Objects in AD representing 10 response groups and it should also find 10 corresponding Workflows mapping to these 10 Response groups.
Every time the RGS service finds a Contact object for RG in AD and its corresponding Workflow it considers this a "Successfully Registered Endpoint". So if you have 10 Response Groups in your Company then the RGS service while restarting should successfully map 10 Contact objects to 10 Workflows giving you a Total of 10 Successfully Registered Endpoints.
You can easily see how many response groups have been mapped successfully by RGS service by collecting Logs for the RGSHostingFramework Component while the service is restarting and then in the logs search for string "Successfully Registered Endpoint"
In our scenario we knew we had 10 response Groups, However when we collected logs we only saw 4 Successfully registered Endpoints, Which meant that When RGS service was starting it was able to only Map four Contact Objects with their corresponding Workflows and those were the only 4 Response groups that were Working and all others were Failing.
We now tried to Focus on finding out why exactly the FE servers are unable to Successfully Register the 6 or so Response groups.
The reason was apparently very clear,
What we found was that the FE server was not actually Failing to Register the remaining Response groups, It simply was not even Trying to Register them.
This was happening because after successfully registering around 4 Response groups it encountered a Response group named - sip:test@Lync1.com
What we saw that Every FE server would successfully Load 4 Response groups and they would all stop at this specific RG - sip:test@Lync1.com
It Was clear that the RGS service was having an issue trying to Map this Particular RG with its corresponding Workflow. The reason could be that the Response Group workflow was either not created correctly or may have been deleted but the Contact Object for the Same may have not been deleted or was still existing in the database/AD. As a result the RGS service was never able to Map the Response Group correctly with its workflow.
Per Design when a FE server encounters this situation it will keep trying to find a corresponding workflow for sip:test@Lync1.com and will not proceed further until it is done. And since the workflow was not existing FE was never able to find it and it kept trying and it never went ahead to load the Other Response groups.
So Basically FE was able to load all RG's until sip:test@Lync1.com and after sip:test@Lync1.com it never was able to move ahead.
This problem may normally happen IF, you may have deleted/modified the RG either from LYNC control panel or power shell etc. but the AD contact Object for this RG was not deleted and hence this RG was still Active as a Contact Object in AD but the workflow was either deleted or inaccessible. This would happen if the person who tried to delete/modify/create the Response group workflow through the LYNC control Panel did not have Permissions on the RTC service container in AD.
To solve the issue, we need to Delete the Contact object for the RG sip:test@Lync1.com from AD using ADSI Edit.
After We deleted this RG and verified the corresponding Object in AD is also deleted and restarted the RGS service, All Response group calls started working fine.
If you run into a scenario where calls to certain Response groups in LYNC are failing with "500 Internal Server Error" then the best thing to do would be the following
1. Collect Logs for the RGSHostingFramework Component using OCS/CLS Logger while restarting the Response group service.
2. Convert the ETL log file into Text Format using OCS logger itself and Open the Log file in Notepad/text Analyzer
3. After loading the file filter for string "AEP discovered" this should give all the RGS application endpoint names in the environment that the RGS application has found based on the Contact Objects. Check how many Endpoints were found and make a note of It.
4. Then in the same Logs filter for String "Successfully Registered Endpoints" This will show how many of the endpoint that were discovered in step 3 above have been successfully mapped to their corresponding Workflows.
5. The Total number of "AEP Discovered" Endpoints should be the Same as the total number of Successfully Registered Endpoints.
6. If they are not then that means you may have probably deleted the Workflows of some Response groups from the LYNC control panel but the corresponding AD contact objects for the same were somehow not deleted. As a result whenever a FE server while restarting finds a RGS contact Object which does not have a corresponding Workflow, it keeps trying to find it and it stops processing any other Response groups as a result some Response group calls may work while the rest of them may fail.
7. Find out which is the last Response group that was successfully registered in step 4. Then Find out the Total "AEP Discovered" Endpoints in Step 3. If you have 10 AEP Discovered Endpoints and only 5 Successfully Registered Endpoints than this means that the 6th AEP Discovered Endpoint is causing the issue. This Endpoint belongs to a Response group whose workflow may have been deleted and hence the FE server cannot map the RGS contact Object to its workflow and it stops processing at this point.
8. Delete the Contact Object of the Problem RG from AD.
9. Restart the RGS service.
10. Repeat the steps 1 to 10 if the problem continues. (You can have more than one Problem Response group Objects in your AD)