To Cluster or Not to Cluster CAs

One of the many enhancements in Active Directory Certificate Services in Windows Server 2008 is support for 2 node active / passive clustering. We have a great whitepaper, Configuring and Troubleshooting Certification Authority Clustering in Windows Server 2008, which walks you through the setup process. Because we just leverage the Failover Clustering already in Windows, the supported hardware and software configurations for running a highly available CA are the same for running other applications on a cluster. Many of the customers I work with have recently asked about whether or not they should implement clustered CAs and the answer really depends on what you're trying to achieve.

The first thing to understand is that having a highly available CA does not mean the same thing as having a highly available PKI. While it performs a critical role, the CA itself is only one part of the overall PKI and it could be argued that other components, such as CRL Distribution Points, are actually more sensitive to outages. In most PKIs, end entities will only talk directly to a CA to enroll for / renew certificates. If a computer enrolls for a certificate with a 2 year validity period, that computer will talk to the CA once to get the initial certificate and then not again until 98 weeks later (assuming a 6 week re-enrollment window). During that long interval, the client doesn't know or care if the CA is online, only that it can find and download a fresh revocation list. Thus, clustering CAs solely to support continuous enrollment services in the case of an outage is often inefficient; it would likely be cheaper and more simplistic to have 2 separate issuing CAs instead.

During an outage, the most critical capability to restore is that of the Certificate Revocation List (CRL). CRLs are used to ensure that certificates used by end entities are still valid and, depending on the application, the inability to retrieve a CRL with a current validity period can cause significant problems. For example, CRL retrieval issues are by far the most common root cause of smart card logon issues. Fortunately, there is no need to rely on clustering to keep CRLs fresh during an outage. So long as you have access to the CA's private key material, you can manually sign and publish CRLs while your CA is offline and ensure service continuity to your users.

None of this is meant to dissuade customers from deploying ADCS clusters, but rather intended to provide some context about what are the right scenarios to use them. The two primary needs for which I recommend clusters are for autonomous failover or geo-dispersal. While manual CRL signing and multiple issuing CAs can ensure that your PKI continues to work during the outage of a CA, some customers prefer failover to be an autonomous activity. In other words, rather than having to manually resign and republish the CRLs, they'd prefer for one CA to just take over for the other with no administrator interaction required. This is a great use case for Failover Clustering and many customers find that autonomous recovery to be worth the investment.

The other major use case is geo-dispersion of CAs to increase survivability in the case of a major disaster. Consider an organization that has multiple datacenters around the world. They may be pursuing a strategy such that one of these datacenters is able to take over for another in the case of a major disaster. Or, the organization may have a dedicated 'hot site' whose sole purpose is to take over operations in the case of the loss of the primary site. In both of these cases, CA clustering provides a great way to ensure that a failure of one site will not interrupt enrollment or CRL signing services for the clustered CA. Typically this style of clustering, known as Multi-Site Clustering, leverages partner solutions to replicate the data between sites.