SCOM Certificate Monitoring – does the Health Service really use the correct Certificate?

Unfortunately this blog has been quite for a while due to Christmas time and a lot of other tasks I had to do but I promise to do better in 2016… :-)

Today I want to talk about an issue I bet almost every SCOM Admins who uses X.509 Certificates for authentication had stumbled upon and that I experience quite often in the field:
The Certificate used for SCOM authentication has been changed and you forgot to reimport this Certificate into SCOM with momcertimport.exe.

Most of you might say: Oh man, that’s old hat! That’s why we have e.g. Rapahael Burri’s excellent PKI MP for.
And yes, you are absolutely right! But a Certificate monitoring MP only warns you if a Certificate is about to expire or has already expired, but it usually does not warn you that the SCOM health service still uses a different Certificate!

Scenario

Let me give you an example:
The SCOM Management Server and Gateways are using Certs for authentication. The Cert is about to expire (that can be monitored e.g. with Raphael’s PKI MP) and will be updated either manually or automatically (e.g. via AD CA auto-enrollment).
Changing Certs automatically is very efficient and lean, but can be extremely dangerous to SCOM because:

Updating a Cert usually means changing the serial number of the Cert (e.g. from SNR 1 to SNR 2). So the serial number of the Cert in the Certificate store of the local computer and the serial number of the Cert used by the SCOM health service (stored in HKLM\Software\Microsoft\Microsoft Operations Manager\3.0\Machine Settings\ChannelCertificateSerialNumber) does not match anymore.
In the screenshot below the serial number in the registry does not match the one on the Cert (I deleted the first byte):

Will updating the Cert cause any immediate trouble? Not necessarily. Everything will continue to work as long as you do not restart the SCOM services, because SCOM caches the specified Cert at startup time. Once you restart the services SCOM will notice that the Cert with the serial number 1 is not available anymore. SCOM will then write an Error Event and immediately closes all connections authenticated by the Cert (mainly the Gateways).

This scenario (SCOM Cert has been changed, everything is running but SCOM uses an invalid cached Cert) is unfortunately not covered by any Management Pack known to me.

A possible solution

To get an actionable alert in case that this scenario happens I wrote a small demo MP with only one monitor that does this job:

  • It runs on all Management Servers (incl. Gateways)
  • Monitor affects health state of the Windows Computer (to get a visible state change as well)  
  • Monitor checks on a regular basis (e.g. every 1h) if the serial number found in HKLM can be found in the Certificate store of the local computer. If not it will create an actionable alert for the operations team.
  • I know, that this might not be the best way, because the Cert in the store might be invalid. But invalid Certs should already be monitored by the PKI MP.

Monitor in an error state (mismatch will be detected, serial number will be shown and the Operator gets the hint to reimport the Cert with momcertimport.exe):

Monitor in a healthy state:

You can find the unsealed MP attached to this post.

Summary

Even if you monitor your Certs closely you might run into the situation that you update a SCOM Cert but forget to register this new one in SCOM as well. My demo MP monitors this situation and creates an actionable alert. Based on this alert a SCOM Admin should then take appropriate action and import the new Cert into SCOM with the help of momcertimport.exe.

Custom.SCOM.CertCheck.xml