The case of (Un)Explained :ConfigMgr SQL Always ON – Upgrade to 1902 with Certificate backup Error using Listener Name


Hi there!

Here is an interesting one (in case you encounter one), Where the upgrade to 1902 for a SQL AO infra was failing with the below error -

CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    Create_BackupSQLCert : SQL server failed to backup cert.
CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    CSiteControlSetup::SetupCertificateForSSB : Failed to create/backup SQL SSB certificate.
CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    ERROR: Failed to set up SQL Server certificate for service broker on "SCCMListener.abc.xyz" .
CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    ERROR: Failed to initialize the site control data.
CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    InstallSystem failed.

The interesting part it is it was showing the ListenerName there as we expect a NodeName there and not the ListenerName.

There is no FCI (Failover Cluster Instance) involved and the fact that the thing was already setup from 1802 and upgraded to 1810 means there was no such configuration issue or else the past update would have failed as well.

We can confirm from the ProcMon below that it is trying to use the ListenerName instead of the individual Node name while accessing the cert.

The interesting part is that the Cert is actually created on the Node.

The thing is we check the File attributes of the CERT after the creation and that’s where we see it is using the ListenerName in the UNC path.

Note that there could be a misconception that \\ListenerName\$ share should point to the ACTIVE node. But this is not true and ListenerName are just dummy Windows Machine records like Cluster, and it has a DNS record and an IP and Windows doesn’t inherently know it how to resolve that name to any ACTIVE Node (This is a SQL feature to be precise)

So hence when you use ListenerName to run a command it still connects to the ACTIVE Node and runs the SP and creates the cert, but when accessing the Windows share via Listener it won't work.

OK enuf said,

Why are we using the Listener Name altogether ? Time for source code check :)

The first hint that was obtained was very significant

So, Here it is – If we actually fail for a SQL Always On Configuration the error in the logs should be below

ERROR: Failed to set up SQL Server certificate for service broker on replica node  XYZ

Note the extra “on replica node”

The error we are failing with is in the ELSE part, if we don’t have SQL Always ON. Something Like

 

IF (SQLAlwaysONConfigDetected)
{
Result = CreateAndBackupSQLCert();
IF RESULT = FAILED THEN
LogMessage ERROR: Failed to set up SQL Server certificate for service broker on replica node XYZ
}

ELSE
{
Result = CreateAndBackupSQLCert();
IF RESULT = FAILED THEN
LogMessage ERROR: Failed to set up SQL Server certificate for service broker on XYZ  <-- And we fail like this entry
}

 

Now this hint was enuf to back track this issue. We can then go straight to the setup log and find the entries where we detect the SQL Always ON configuration and its nodes.
And fair enuf we see the below errors on the top where we check the config.

05-08-2019 17:27:02.822    CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    ERROR: All availability replicas must have the same DB seeding mode.
05-08-2019 17:27:02.822    CONFIGURATION_MANAGER_UPDATE    28292 (0x6e84)    ERROR: Failed to retrieve SQL always on nodes 

Now things are more clear that we were not able to retrieve SQL AO nodes and instead of failing here we continued the SETUP thinking it   as a normal SQL instance with ListenerName as SQL server name 😊

The error is also pretty clear that the seeding mode is not the same between nodes. We can run the below query which ConfigMgr uses -

SELECT R.replica_server_name, S.role, CAST(R.seeding_mode as int) AS seeding_mode FROM sys.availability_replicas R
INNER JOIN sys.dm_hadr_availability_replica_states S ON R.replica_id = S.replica_id
INNER JOIN sys.availability_group_listeners AGL on AGL.group_id = R.group_id
INNER JOIN sys.availability_group_listener_ip_addresses AGLIA ON AGL.listener_id = AGLIA.listener_id
WHERE  (R.availability_mode = 1 OR S.role = 1)

In my lab this shows as below

In our case one node was zero and the other was one. So what are these 1 and 0.

1 = AUTOMATIC
0= MANUAL

Read more on seeding here: https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/automatic-seeding-secondary-replicas?view=sql-server-2017

 

So how do you change it to match the seeding?

 

Simply from the UI by clicking on Availability group properties

OR  thru a TSQL way -

ALTER AVAILABILITY GROUP [AGName]
MODIFY REPLICA ON 'Replica_Name'
WITH (SEEDING_MODE = AUTOMATIC)

So once that was done we were good.

So what are we doing about the fix ?

The good part is we have already fixed this now as part of 1906 in two folds:

  1. We will have the check for the incorrect seeding in the PreReq check itself.
  2. When the seeding is inconsistent then FAIL it there itself instead of continuing the setup and fail much later with a confusing error.


Hope it helps!

Umair Khan

Support Escalation Engineer |Microsoft System Center Configuration Manager 

Disclaimer: This posting is provided "AS IS" with no warranties and confers no rights.

 


Skip to main content