I first want to thank one of my customers for providing the screenshots to this particular issue. I’ve run into this particular issue a couple of times, but never had screenshots that I could use to demonstrate the behavior. That said, gray agents are generally fairly easy to trouble shoot. They have a wide range of causes, and MSFT has provided a nice article that can be used as a starting point to figuring out their causes. However, there are a couple of oddities I’ve run into when troubleshooting them that are worth documenting.
First, the easy stuff, in case you are new to SCOM:
- Bad Process: The server decommissioned and no one updated SCOM.
- The server is actually down: Letting us know about this is SCOM’s job, right?
- Microsoft Monitoring Agent (or System Center Management Service if you are running an older version of SCOM) is not started.
- Health service cache is corrupt.
- Problem with your certificates if you are using certificate authentication.
I’m sure there’s a few more, but these are the typical ones I see.
Anyways, when you get through the basic items, and the server is still gray, it gets a bit more complicated. In this case, we observed the following errors:
Event ID 1220 Operations Manager Log: Health Service – Not enough storage space (8L)
Event ID 7022 Operations Manager Log: “The Health Service has downloaded secure configuration for management group <INSERT MG NAME>, and processing the configuration failed with error code Cannot find certificate and private key for decryption. (0x8009200B).”
Also Event 7029 Operations Manager Log:
Finally, we see some Schannel errors in the windows logs. Event ID 36871. In this particular case the Schannel errors did not go away after correcting the issue, for what it’s worth:
There’s not a lot of SCOM documentation on Schannel, but SCOM is dependent on the SChannel system working properly to do authentication. I have run into cases where for whatever reason certificate authentication breaks due to incorrect registry configuration in Schannel, that fix is making sure SSL and TLS protocols are turned on. You can observe that in the registry here:
That said, this was not the issue at hand either. As it turns out, it is a permissions issue, but not a commonly known one. Something has stripped out the permissions for the Local System Account/Local Admins to a hidden directory. That directory is as follows:
Both System and Administrators need access to that folder, and if their permissions are taken away, your agent will gray out.