Agents that never connect to management server


 

Was working with a customer on this issue:

The agent would install correctly, it would even push install (but took forever) or a manual installation would make it show up in pending, but after approval, it would never communicate with a management server.

The logs on the management server didn’t show anything interesting.

The agent was logging this specific event – with the unique part highlighted:

Log Name:      Operations Manager
Source:        OpsMgr Connector
Date:          10/27/2014 10:07:37 AM
Event ID:      20071
Computer:      foo.contoso.com
Description:
The OpsMgr Connector connected to MS1.contoso.com, but the connection was closed immediately without authentication taking place.  The most likely cause of this error is a failure to authenticate either this agent or the server .  Check the event log on the server and on the agent for events which indicate a failure to authenticate.

Normally, we see the agent getting “rejected” by the management server.  In this case, the management server just didn’t respond.  We ran a verbose ETL trace of the agent, and captured an agent startup, which includes the attempt to communicate with the primary assigned MS:

[MOMChannel] [] [Information] :MOMChannel::ChannelTimeoutManagerImpl::OnTimerCallback{ChannelTimeoutManager_cpp117}Channel has timed out after 1498ms

There are a few possibilities.

First, there was a fix put in UR3 for SCOM 2012R2 to change some of the default timeouts for communication from 1 second to 20 seconds.  This helps resolve issues when agents are a long distance away, network wise, and Kerberos auth takes a long time.  So my first recommendation would be to apply UR3 to both management servers and agents and attempt a repro.

However, this was not the case for us.  These were in the same datacenter, on the same subnet even!

To rule out a network issue, we tried to copy a large zipped file across the network, and saw this take a very long time, then it failed on the copy. 

Next, we performed a ping test:

ping servername –t –L 65500

The –L in ping allows us to control the packet size sent via the ping, and we saw the server either have extraordinary ping times, or timeout altogether.  This all points to a failure in the network card.  Sure enough – this was a physical server and not a VM.  A reliable as today’s hardware is, you just cant rule out an old school issue like this.


Comments (3)

  1. Taha Ansari says:

    Hi kevin, first of all thanks for writing on this issue as i was not able to get anything on this.

    We have faced this issue 3 to 4 times in our environment but most of the agents were on VM and the agents on it’s core server were working fine.

    I was not able to resolve this issue tried everything from renaming health service state folder to restarting the server,
    In the end it would resolve on it’s own after 5-6 days.

  2. Rajul says:

    Thanks Kevin for this info. Really helped us in resolving the mystery.

  3. Leaven Hoo says:

    Hi,Kevin! These dates I installed a SCOM server in a domain successfully .But When l user a domain admin user try to push agents to customers , I also get this error.:Event ID: 20071,the SCOM management keeps showing that the agents are installing until now ,and I can only click “reject” and “copy” button(But it is not “rejected”),the others buttons are grey.But when I checked the customers,I found the agents were all installed successfully.I use Chinese version,you can see my detail issue on http://partnersupport.microsoft.com/thread/104e466e-7555-4b33-9227-d178308c7b5f .I try many for weeks but no use,could help me?

Skip to main content