Dedicating Servers to Distributed Cache in SharePoint 2013

I think I’ve been asked questions about this a half dozen times in the last month, so I decided to just capture my thoughts in a blog past and as always, let you – the viewing public – swing away at it. The question posed has been should we dedicate a separate set of servers for the distributed cache (DC) service in SharePoint 2013. To get right to the point, in my opinion, yes you should. 

I’ll state up front that this has been something of a turn from my original ideals before the product was released. I had intended originally to co-locate the DC service on WFE machines. As I continued to use the product and work with other folks that had deployed it, I began to see more cases where the DC service seemed to periodically perform “flaky”. I experienced this myself multiple times in multiple farms, and actually spent a fair amount of time trying to troubleshoot the issue, with fairly limited success. What I came to find in my own environments is that where I had moved the DC service onto its own dedicated servers, those “flaky” issues disappeared.

Having my own anecdotal evidence in hand then, a few of us started asking around to try and get some thoughts from other folks on why this might be happening. Eric F. ended up providing us with a really great explanation of why you might see “flaky” or inconsistent behavior with the DC service when it’s co-located with other services, so I’ll paraphrase his comments here:

  • We need 1 to 2ms response times. It’s going to struggle to deliver such fast response times if it’s sharing the server with other apps that can occupy the CPU at inopportune moments.
  • The job of the DC service is to cache stuff for you so it wants to grab a bunch of memory and hold onto it. But to deliver great performance (see bullet #1), it needs to avoid swapping memory pages out to disk; if the machine runs low on memory it has to evict content from its cache. When the DC service is co-located it has to live with other apps that have their own pattern of memory usage. Those patterns will have periods where they consume a lot of memory, and then periods where (either due to GC, IIS restart, or their own cleanup) shrink back down. That’s difficult for the DC service to manage because whenever the memory usage spikes it has to empty its caches.

So based on my own experiences and the great explanation for why we would see inconsistent behavior with DC services, that’s why I now recommend putting the DC services on their own dedicated set of servers. Someone astutely asked “well why can’t I just put a lot more RAM and CPU on my WFEs and co-locate”? Well you can, but there are couple things you need to consider. First, given the explanation above, there’s no guarantee of some magic “limit” or “boundary” where you can 100% ensure that DC services will not be impacted. Second, by the time you add “a lot more” RAM and CPU on a WFE, chances are it would have been more effective to take that extra capacity and just carve out a 2 or 3 additional virtual images and have them host the DC service. 

Yes, there are some extra costs beyond that, like software licensing and the operations of those servers. However, there’s also a cost if you get stuck trying to troubleshoot results that are inconsistent and nearly impossible to reproduce. There’s also a cost if your users lose confidence in the stability of the farm because of random issues caused by having features flake out if the DC services are not working as needed. Ultimately you’ll need to decide what’s best for you, your budget, and your customers. For me though, I recommend hosting the DC services on a set of dedicated servers.