LCS - Hardware Load Balancers and time out values

Customers
ask questions regarding the configuration of time out values for LCS and their
Hardware Load Balancers. There are 2 partner load balancer solutions of which
we have documentation on at the https://office.microsoft.com/livecomm site.

Some customers are looking in the
Planning Guide at the following blurb and end up with questions.

Must
provide a configurable TCP idle-timeout interval with a maximum value greater
than or equal to the minimum of the REGISTER refresh / SIP Keep-Alive interval. Attribute
msRTCSIP-DefRegistrationTimeout

The
customer question on this internal email discussion was:

The concern
I have here is that when a failover takes place it typically takes 90 seconds.
Sometimes it's a little faster sometimes a little slower. After talking to our
Microsoft rep he said that failover should be fairly seamless to the user. Our
load balancer (BigIP) has a TCP(5060) monitor set up to check every 5 seconds.
I have noticed during testing that the load balancer does detect the server is
unavailable within that time but it appears the Messenger client doesn't
failover until a much later time. The default reg expiry value is set to 600.
If I set the load balancer to that it will only check every 10 minutes to make
sure the server(service) is up. Do you have any thoughts on what needs to be
changed? Is 90 seconds accurate or should failover be much more seamless?


**

<Product Group member responses>



The
planning guide recommends adjusting the TCP idle-timeout interval on the
loadbalancer based on the default setting. Adjusting
the default reg expiry based on the loadbalancer setting is not recommended.


The LB
setting you have mentioned below is the heartbeat interval between the LB and
the front-end. It is fine for it to be 5 seconds. The corresponding blurb from
the planning guide is

The Load
Balancer must be able to detect Live Communications Server availability by
establishing TCP connections to ports 5060, 5061 or both (often called a
‘heartbeat’ or ‘monitor’). The pooling interval must be a configurable value,
with a minimum value of at least five seconds. The Load Balancer must not
select a Live Communications Server that shuts down until a successful TCP
connection (heartbeat) can be established again.

The
other LB setting is the TCP idle-timeout which must be configured according to
the following. This is not related to the heartbeat internal mentioned above. The Load
Balancer must provide a configurable TCP idle-timeout interval with a maximum
value greater than or equal to the minimum of the REGISTER refresh / SIP
Keep-Alive interval.

Yes
failover will be fairly seamless to the user and it is normal to take about 90
seconds. The client has inbuilt randomization for sign in retry to avoid
stressful spikes on server load when thousands of clients are connected to the
server which might adversely affect client experience. It is ok
for the TCP monitor setting to be set at 5 seconds as this will help the LB to
mark the server down quicker. Any new clients will not be load balanced to this
server. On the
other hand any existing clients will use the inbuilt keep alive mechanisms and
retry randomizations to log back in seamlessly to another server.


For every TCP
connection the load-balancer maintains state associating that connection with a
particular target server. This state has an associated timer that determines
how long the connection has been idle (aka. inactive). This is the TCP
idle-timeout interval. If this setting is smaller the REGISTER refresh interval
or SIP Keep Alive interval (SIP Keep-Alive interval is fixed at 5 minutes) then
the load-balancer will TCP idle-timeout will hit and reap the connection
removing its state. A subsequent data packet from the client will fail with the
load-balancer indicating that the connection was closed which will cause the
client to have to retry and re-establish a new connection (this is expensive
for the server and will cause intermediary failures for the client during the
retry period.)


I hope this
provides some helpful background on the timeout values and their
relationship. One of our challenges in support is that not all vendors
have specific information on the configuration of their solution and
they use different/proprietary terms. We will do our best to help as
always.

Toml LCSKid