Receiving alerts "The agent was unable to send data to the MOM Server at {MOM ServerFQDN}. The error code is 10054, An existing connection was forcibly closed by the remote host"?

If you're receiving alerts "The agent was unable to send data to the MOM Server at {MOM ServerFQDN}.  The error code is 10054, An existing connection was forcibly closed by the remote host" this is typical of MOM Agents connected via a slow link to their management server.

Although more commonly received in MOM 2005 (pre-SP1) due to https://support.microsoft.com/?id=885416 (MOM 2005 agent logs an "existing connection was forcibly closed by the remote host" event), this can and will still be received on MOM 2005 Service Pack 1 when agents attempt to communicate with a management server over a slow link.

The hotfix introduced in KB885416 is a management server fix which increases the timeout under which a management server will disconnect an agents active session if there is no communication between the two. The value was changed from 1 second to 15 seconds when the hotfix is applied. The fix was rolled into Service Pack 1, hence why this is less of an issue to those running SP1.

Here comes the however :)

If an agent is over a particularly slow link, 15 seconds may simply not be enough before the management server will time out an active session with that agent. The typical case is when a rule update payload is fairly large (ie.you have many management pack installed and the agent is a member of a large number of computer groups). In this case, the agent can initiate the session, attempt to download rules (which I have seen take 30-40 minutes over some very slow links) and have not indicated success by the time the management server has cancelled the session.

The hotfix for KB885416  and then MOM 2005 Service Pack 1 introduced a registry key (this key does NOT exist be default):

HKEY_LOCAL_MACHINE\Software\Mission Critical Software\OnePoint\Configurations\configuration group name\Operations\Consolidator\ServerIOTimeoutMS:REG_DWORD

 

By adding this key on any management server managing remote agents you can overcome these errors.

 

So, what value to use?

Well, the value is in Milliseconds. Remember the default, when the key doesn't exist is now 15000 (dec) in MOM 2005 Service Pack 1. As the KB article implies, it's recommended to keep the value as low as possible, but still maintain successful communication. This is where you need to tune the value. I personally recommend increases in 60000 (dec) increments. However, for very slow links you may personally decide to approach from the other end e.g. 1800000 (or 30 minutes).

Having a larger value increases the timeout value to allow configuration and rule updates to occur successfully. Keeping it within sensible limits is of course advised. This does then prevent occurences where MOM agents could potentially initiate a session and die, leaving the session open waiting on the ServerIOTimeoutMS to expire.