This is just the first of a series of posts regarding performance of Live Communications Server. Performance is a complex topic, so I thought I would step back and describe some of the basic principles behind what we do in the product group to analyze and optimize the performance of the server. I’ll start with describing what’s unique about performance for LCS and what makes it a tough problem to solve.
LCS is a real-time communications and collaboration server. Real-time in this sense means that the latency behind delivering a piece of information is almost as important as the information itself. It’s called instant messaging for a reason after all. If it takes the server over a minute to deliver an IM, then what was once a productivity-enhancer is now a productivity-drain. Sort of like watching a interview being conducted over a slow satellite link. The communications aspect is important as well. The fundamental job of the server is to deliver a packet from one user or device to one or more other users or devices. Most of what the server does is react to a request from a client and inform other clients about that request. This is different of course than a typical web server application which is responding to a request based on some backend business logic. And it’s different than the store-and-forward problem that a mail server such as Exchange deals with.
Presence is Hard
By far one of the most difficult jobs of the server is handling presence. The reason why this is tough is obvious once you think about it. The server has to take a presence state change from one client (say I change my presence to “Out to Lunch”) and then turn around and inform everyone watching that client’s presence about the change. One request from the client has now turned into potentially a hundred messages which must be fanned out by the server. The amplification effect of presence is the primary performance problem we work to address. Latency is critical here as well. If the server takes 15 minutes to let you know that someone else is now offline, then that presence information becomes less meaningful.
Scaling to Thousands of Users
If the server only had to handle requests for a single client, it wouldn’t be much of a problem, would it? A growing trend in IT orgs everywhere is server consolidation. You want to be able to support more users with less servers. This translates to us needing to support 10s of thousands of users in a single server cluster. Take any operation that a single client might perform, now multiply that by 100,000. Suddenly something as simple as sending an IM between two users becomes much more intense.
To summarize, the server has to take a message from a client, process it, potentially fan it out to 100s of other clients, and repeat that for 100,000 users on a pool. Oh, and it has to do that in real-time. Not an easy task.
Performance on LCS is defined by several key metrics:
· How many users per server can we support?
· How many messages/second can that server support?
· What happens to CPU, memory, and network bandwidth under load?
· Can the pool recover when a server fails or becomes temporarily overloaded?
· What percent message failures are we seeing?
Since the server is all about communications, ultimately, performance is all about how fast we can turnaround incoming messages to outgoing responses and what kind of sustained message rate can we support while still leaving enough head room to handle sudden spikes in traffic. While testing performance we look carefully at each of these metrics.
Next time I’ll go into more depth about how we use the SIP protocol and the performance implications of doing so. I’ll also give some more insight into our 2-tier architecture and how that allows us to scale-out.
– Sean Olson
Lead Program Manager