High Availability and Client Retry

There hasn't been much discussion about high availability of the server and failover capabilities of the client, but it’s something that I think is pretty cool, and JUST WORKS. And since I tested it, I think it works pretty well and deserves a kudo to Dev. So let’s blog about the Office Communicator client retry logic with regard to the Live Communication Server 2005 high availability feature. So to start this discussion off, why don’t I try to explain the behavior that may be expected from the client’s standpoint as a user when a front end server in an Enterprise Edition pool becomes unavailable due to a reboot, deactivation of the server, power failure of the box, loss of network connectivity to that machine or whenever OC clients detect that the server they are connected to becomes unavailable.

The behavior that I describe below doesn't include CPU or RAM usage climbing to a very high level on one front end server or during a crash of the server (rtcsrv.exe) process or in scenarios where the client process (communicator.exe) crashes and is re-launched and registers. In short, the description of what happens is, clients use their inbuilt keep-alive mechanisms, client retry logic and server reconnection randomizations to detect when their front-end server goes down and connect seamlessly to another front-end server.

This is called server "failover" and is actually “fairly seamless" to the end-user and it is normal to take about 90 seconds to complete. Server failover is "fairly seamless" because some IM messages may have to be manually sent again. During the time that the client connection retry logic is executing beginning when the client senses the server is down using a white space keep alive mechanism, any messaging will be interrupted but there will be an indication in the IM window that these messages could not be delivered. During that time, the hardware load balancer will mark the front end as "unavailable" and presence updates will cease from client endpoints that are retrying a quick connect to the same front end, attempting to reconnect and reconnecting to a different front end server.

Once an available server is detected, the client has an inbuilt randomization for sign-in retry to avoid stressful spikes on the server if thousands of clients are simultaneously re-connecting to the pool after a particular front end goes down. This can be bad if the Enterprise Edition pool is operating near the maximum capacity planned for it and during peak load times. If this randomization ("retry-after") throttling logic wasn't present in the design, then it would adversely affect the user experience when many clients are reconnecting to the EE pool.

Once failover is complete and the client is re-connected to another available front end server in the pool, messaging can continue in the same IM conversation window that is already open between the two users and presence updates are resumed. In the case of multi-party IM conversations, by design when IM conversations span multiple front-ends, IM sessions are torn down and new conversation windows are opened following failover.

The bottom line is, and the KILLER aspect of this feature is... if failover is happening properly and there's always at least one server up in the pool, normally there will never be a case where clients will get signed off and have to sign in manually again. It’s like you’re typing in a window, you get a few messages back saying, “can’t deliver” then you don’t get those massages any more and the other guy starts getting your IMs again. Other users see your presence freeze and can’t send you messages while seeing those errors and after a minute or so, your presence starts working again and you can get IMs.

- Stu Osborn

Program Manager