Fixing troubled agents


Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine.  Agent health is an ongoing task of any OpsMgr Admin’s life.

This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time.  There are so many factors in an agent’s ability to communicate and work as expected.  A few key areas that commonly affect this are:

  • DNS name resolution (Agent to MS, and MS to Agent)
  • DNS domain membership (disjointed)
  • DNS suffix search order
  • Kerberos connectivity
  • Kerberos SPN’s accessible
  • Firewalls blocking 5723
  • Firewalls blocking access to AD for authentication
  • Packet loss
  • Invalid or old registry entries
  • Missing registry entries
  • Corrupt registry
  • Default agent action accounts locked down/out (HSLockdown)
  • HealthService Certificate configuration issues.
  • Hotfixes required for OS Compatibility
  • Management Server rejecting the agent

 

How do you detect agent issues from the console?  The problem might be that they are not showing up in the console at all!  Perhaps they might be a manual install that never shows up in Pending Actions?  Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”.  Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

 

One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log.  This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent.  That is ALWAYS one of my first steps in troubleshooting.

 

Another way of examining Agent health – is by the built in views in OpsMgr.  In the console – there is a view – Located at the following:

 

image

 

 

This view is important – because it gives us a perspective of the agent from two different points:

1.  The perspective of the agent monitors running on the agent, measuring its own “health”.

2.  The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server".

 

If any of these are red or yellow – that is an excellent place to start.  This should be an area that your level 1 support for Operations manager checks DAILY.  We should never have a high number of agents that are not green here.  If they aren’t – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

 

Now…. the following are some general steps to take to “fix” broken agents.  These are not in definitive order.  The order of steps really comes down to what you find when looking at the logs after taking these steps.

 

  • Start the HealthService on the agent.  You might find the HealthService is just not running.  This should not be common or systemic.  Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure.  However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM.  Look in the OpsMgr event log for verification.

 

  • Bounce the HealthService on the agent.  Sometimes this is all that is needed to resolve an agent issue.  Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

 

  • Clear the HealthService queue and config (manually).  This is done by stopping the HealthService.  Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder.  Then start the HealthService.  This removes the agent config file, and the agent queue files.  The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to.  From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not.  This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value.  The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while.  This is very much what happens on a new agent during initial deployment.

 

  • Clear the HealthService queue and config (from the console).  When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane – “Flush Health Service State and Cache”.  This will perform a very similar action to that above…. as a console task.  This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server.  This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.

 

  • “Repair” the agent from the console.  This is done from the Administration pane – Agent Managed.  You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action.  A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment.  It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

 

  • Reinstall the agent (manually).  This would be for manual installs or when push/repair is not possible.  This section is where the combination of options gets a little tricky.  When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way.  This means performing the following steps:
    • Uninstall the agent via add/remove programs.
    • Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe.  This is designed to make sure that the service, files, and all registry entires are removed.
    • Ensure that the agent’s folder is removed at:  \Program Files\System Center Operations Manager 2007\
    • Ensure that the following registry keys are deleted:
      • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
      • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
    • Reboot the agent machine (if possible)
    • Delete the agent from Agent Managed in the OpsMgr console.  This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
    • Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent.  Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used)  If it works correctly – you can always reinstall again using low priv or AD integration.
    • Remember to import certificats at this point if you are using those on the individual agent.
    • As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

 

To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot.  However – to summarize at a very general level, my typical steps are:

  1. Review OpsMgr event log on agent
  2. Bounce HealthService
  3. Bounce HealthService clearing \Health Service State folder.
  4. Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.

 

Also – make sure you see my other posts on agent health and troubleshooting during deployment:

Console based Agent Deployment Troubleshooting table

Agent discovery and push troubleshooting in OpsMgr 2007

Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility

Agent Pending Actions can get out of synch between the Console, and the database

Which hotfixes should I apply-

Comments (19)

  1. Kevin Holman says:

    We really try hard to come up with ways to solve the problem without resrting to editing a SQL table directly…. doing so is really unsupported and should only be done under the direct guidance (or should I say order) of PSS in a case with Microsoft.  There are a few circumstances, where that seems to be the only recourse… but we should exhaust all other options first.

  2. Anonymous says:

    Useful information Kevin,

    I'm also into a situation where 1-2 agents are not healthy. While checking i found config.xml file is not updated though i cleared the cached and even allowed the system to recreate Health Service State folder but that failed too in updating the xml file. I've also noticed Temp folders are not getting created on these agents. I've reinstalled the agent as well. Agent gets into gryed state after a while even if i restart the service.

    In the event log i see lot of logs generated   Rule/Monitor "Microsoft.SystemCenter.LearningModule.FailedInitialization.Alert"  cannot be initialized and will not be loaded" and many more similar to this.

    Any help is appreciated.

  3. Sameer Dave says:

    Thats a very good article Kevin.

    I have seen one more problem where agents are hung in one of the tables of SQL, specially during new installations.

    I have seen that once you delete that information from the tables, then you could install the agent again fine.

    Thanks for the great article once again

  4. TechJet 2010 says:

    Thanks Kevin, very good blog.  Can you provide any additional advice or reasons why an agent health turns grey, we get this a lot ?  the agents are multi-homed, could this be impacting ?

  5. DJ says:

    This Blog is very useful.  As to other potential issues with grey agents, check out this kb  support.microsoft.com/…/2288515.

  6. Muhammad Saad says:

    Simply Log on to DC and run the following commands

    1. hslockdown /L

    you will see NT Authoritysystem is in denied state

    Then run the command to bring it in allowed state

    hslockdown /A "NT AUTHORITYSystem"

    Cheers

    Saad

  7. Coolz203 says:

    hi Kevin,

    Great article.  I have used this advice a few times to help with agetns issues.  However I have come across with an issue I am having a hard time with.  I have an agent deployed and teh agent is showing healthy in the Agent State view.  This particular  agent is on a Windows 208 R2 server.  For some reason the disovery of this windows 2008 server is not working.  I have other windows 2008 servers that are working fine.  The agent knows enough that it is on a windows server, but all of the OS specifc monitors are not active.  The logs show nothing.  I am at a loss here.  I have cleared the cache, repaired the agent.  Any help is apprecieated.  thanks.

  8. jayson says:

    Kevin,

    My problem lies on the Root Management Server. Absolutely everything is running with no issues but for some reason, the Server is greyed out I can restart the service and it is okay for a few minutes, then goes right back into the greyed out status… Operationally and all functions correctly but it just never looks good to see the RMS greyed out… Any ideas?

  9. shahar says:

    After upgrading System Center Essentials 2007 with the latest OS Management Pack, the owner’s agent of the Hyper-V cluster became grayed out.

    If I change the cluster current host server to other server, it becomes grayed out and the previous one (which was the current host server before) becomes healthy again.

  10. Hemant says:

    How to Flush the Health Service State and Cache on multiple machines at a time? any command line utility available?

  11. zahurulislam says:

    It will also apply/reapply any agent related hotfixes in the management server’s Program FilesSystem Center Operations Manager 2007AgentManagement directories.

  12. zahurulislam says:

    It will also apply/reapply any agent related hotfixes in the management server’s Program FilesSystem Center Operations Manager 2007AgentManagement directories.

  13. charlie says:

    "The problem might be that they are not showing up in the console at all!"….any suggestions for diagnosing this problem? This particular agent also has no "OperationsManager". A few of the other logs are there, but many that I typically see in a client
    are missing. This was a manual installation of the ccm client.

  14. charlis says:

    i have this problem and i tried all the steps that i know but the problem still exist.can any one help me..i am using SCOM 2012 ,
    "The System Center Management Health Service 1EC09CB7-1B1E-EAC9-D15A-D2C927046DE2 running on host xxx-xx-xxx.Root.net and serving management group with id {0407FB6F-896A-7389-EA01-D60C72ABBD5A} is not healthy. Some system rules failed to load."

  15. Dominique says:

    Hello Kevin,

    Excellent article for the machine, but do you have something similar for Virtual Machine working through collectors and not reporting…

    Thanks,
    Dom

  16. khalid khan says:

    helpful tips .
    i have an issue . i have scom 2012 sp1 . when i am checking windows server computer group it showing me windows 7 computers as well .
    when i am creating new group for servers it also showing me windows 7 computers mix with windows servers.

  17. Nirmal says:

    Hi Kevin,

    You have been an inspiration since I started working on SCOM. Your blog has helped me a lot. Thanks a ton!

    I have been encountering an issue in my environment. We have a domain say ‘A’ on which we have our MS and we have a gateway server on our another domain B which has trust relationship with domain ‘A’. We have an agent in domain ‘C’ which has two way trust with domain ‘B’.

    When I tried to install agent on the server in domain ‘C’ and make it communicate to the gateway in domain ‘B’.
    Agent is not communicating with the gateway, we could see event 20002 on the gateway and event 20070 on the agent machine.
    It is not able to be authenticated and getting rejected.

    Could you please help me on this issue ?

    1. Kevin Holman says:

      Just because you have a trust between B and C, does not mean Kerberos is supported. If agent in C is rejected by GW in B, it most likely means you need to use certificates between C and B.