Quick note from the field: AD integrated SCOM agent might temporarily lose Management Group information

A few days ago I stumbled upon an issue which might be interesting for you:
In a SCOM environment with AD integration activated, it might happen, that some agents become grey and unmanaged after a reboot (e.g. after a patch day). Only a manual restart of the Healthservice fixed the issue.

Checking the Operations Manager Event log, we can see:

Warning 5/1/2015 2:40:11 PM HealthService 2120 Health Service
-> The Health Service has deleted one or more items for management group "Unknown" which could not be sent in 1440 minutes.
Error 4/30/2015 5:38:06 PM HealthService 2000 Health Service
-> The Management Group XXX failed to start. The error message is The environment is incorrect.(0x8007000A). A previous message with more detail may have been logged.
Error 4/30/2015 5:38:06 PM OpsMgr Connector 20100 None
Error 4/30/2015 5:38:06 PM OpsMgr Connector 20100 None
-> The OpsMgr Connector for management group XXX cannot connect to Active Directory to retrieve connection policy. The error is Unspecified error (0x80004005)
Information 4/30/2015 5:38:06 PM OpsMgr Connector 20062 None
Error 4/30/2015 5:38:06 PM HealthService 2010 Health Service
-> The Health Service cannot connect to Active Directory to retrieve management group policy. The error is Unspecified error (0x80004005)

So, what happened?

The server was restarted after patch day. Looking at the System Event log of the server, we can see, that for some reason (presumably network issues) the server cannot contact the Domain Controller for about 10-20 seconds after reboot.
During this period, the MMA starts and tries to contact the AD to get its Management Group information. Because of the aforementioned temporary network issues, this is not possible and considered a critical failure by the agent (see first error events).

Explanation

In this rare Situation the agent will not try to reconnect to AD after some time. This is done only if the HealthService is fully initialized, which it isn’t in this case. About 24h later (see latest Warning event), the agent will unload all Management Group information and will run idle, as long as the service is not restarted.

Possible Workarounds

  • One possible workaround is of course the manual HealthService restart after the OS is fully initialized and connected to AD.
  • Another possible workaround might be to configure the HealthService for delayed start.

Best solution of course is to find the root cause for the temporary network issues and fix it.