Note: I have recently changed roles to become part of the Information Protection Team at Microsoft (the group responsible for building AD RMS and related technologies) where I will be acting as a Sr. Program Manager. Since the team already has a blog on AD RMS I have decided to concentrate my efforts in that blog, which you can find at http://blogs.technet.com/rms. My previous blog posts have already been moved there and in the future you should go to that blog for updates and news (quite a few of them are coming!).
You can find this particular post at http://blogs.technet.com/b/rms/archive/2012/04/16/ad-rms-redundancy-and-fault-tolerance-part-1.aspx.
A very common question when deploying a service in the enterprise is how to make it resilient. In the case of AD RMS this typically means setting up the infrastructure so server failures don’t affect the user’s ability to consume content. In this post we will cover the back-end side of the solution, and in a future post we will discuss the making AD RMS servers themselves resilient.
The first thing to consider when designing AD RMS for high availability is that, once activated, clients in many cases don’t need to access the AD RMS server infrastructure for some of the RMS-related operations they might want to perform. Content protection with Office applications is always performed offline at the clients, so even if a client can’t reach an AD RMS server, the user will be able to protect a document or email without problems. When protecting a document, a client will also issue itself an Author license, which will allow the user that has just protected the document to continue consuming the document without having to contact the server, even after closing it and reopening it.
Users consuming content protected by others might also be able to make use of the content without having to first contact the server. That depends on several factors: if they have acquired a license before, if the use license is cacheable and if the license is still within its validity period. The possibility of caching a license is defined either in the template used to protect the document or the setting to “Require a connection to verify a user’s permission” that can be used when manually protecting a document (this setting can also be pre-set at the author’s machine via registry overrides). Cacheable licenses are valid for up to one year but the maximum duration can be specified in a template.
A user might have acquired a license before if the document (email or attachment) was pre-licensed by Microsoft Exchange. In these cases, Exchange will acquire a license on behalf of the user before delivering the email and any attachments to the client, so the user won’t have to get a license at the time the content needs to be consumed.
In any of these cases when the user has a valid license in the machine to consume the document, the client won’t need a connection to the AD RMS server in order to use the document, so whether the server is working and accessible or not won’t have an impact in the user’s ability to consume the content.
But for the initial consumption of a non-prelicensed piece of content or for consuming a document after any previously acquired license has expired, the client will have to contact the AD RMS server. A client will also need to contact the AD RMS server the first time it is used on a machine, or when it is time to renew the client machines or the users certificates, which typically last one year.
Since those are not all that uncommon situations you should in most cases ensure your AD RMS infrastructure is highly available.
Let’s review what components need to be made highly available for AD RMS to be.
Besides the client components, the key elements that make AD RMS tick are the AD RMS servers themselves, the RMS Database and Active Directory. It should be fairly obvious that any AD infrastructure needs to be made highly available, and most AD deployments are made with that in mind. One thing that is often missed though is that an AD server will only talk to an AD Global Catalog server for tasks such as group expansion, and that means that the AD RMS servers have to always have access to an AD GC in order to work effectively. While the AD Caching database that’s part of the AD RMS database infrastructure can provide some ability to continue operation when a GC is not available, it is not substitute for a GC located close to the AD RMS servers, and since any DC might be out of service at some points, you should have at least two GCs in the same network or site as the AD RMS server.
The next element to consider is the AD RMS Database. AD RMS uses the database for multiple tasks, including:
- Storing the server/cluster keys and configurations, including Rights Policy Templates.
- Caching information from AD.
- Storing logging information about activities performed on the cluster, such as issuance of licenses.
- Storing and retrieving copies of the user’s Right Account Certificates.
The first thing to note is that AD RMS retrieves the server’s configuration and the cluster keys when the service is started. So once an AD RMS server is up and running it will continue to work even when the database is unreachable.
Also, most of the write operations between an AD RMS server and its database server are done via the intermediate agent of Message Queuing running on the AD RMS host. That means that if the DB is not available at one point AD RMS will continue to perform these operations while keeping the information in a queue in local memory until the database becomes available again, and it will dump the data into the database whenever it becomes available again. This means that AD RMS can continue to work for long periods of time without access to the AD RMS database and logging information will continue to be gathered.
Regarding the caching DB, AD RMS will only use it if it is available. If the RMS DB is unreachable, AD RMS will try to contact a Global Catalog for fresh information about users and groups, and other than a potential load increase in AD this shouldn’t affect performance of the service.
Which leaves us with the last point in the list above: storing and retrieving copies of the user’s RACs. This is the only recurrent frequently performed operation that needs to be performed with access to the RMS Database. AD RMS needs access to the RMS DB every time a user is activated on a new machine in order to check for a pre-existing RAC and in order to save a RAC when one is created. It also needs access to pre-existing RACs when Exchange needs to pre-license content, since in order for Exchange to request a use license on behalf of the user the server needs to know the public key of the user it needs to be issued for.
So what does this all mean?
It means that if the AD RMS database fails, AD RMS will continue to operate almost normally, with only partial loss of functionality. What functionality will be lost?
- AD RMS servers won’t be able to be booted or rebooted while the RMS Database is unreachable.
- New user activation won’t work until the database is back in service. That is, a user that has never used AD RMS on a machine won’t be able to activate the machine and use AD RMS until the RMS Database is reachable.
- Exchange pre-licensing will not work during the DBs downtime. Still, email content will be accessible with a normal licensing request when opened as long as the AD RMS servers are available at consumption time to issue a license.
- Changes in configuration such as creating or modifying RMS templates or changing exclusions won’t be possible with the DB inaccessible.
- Some operations, such as adding users to a revocation list, can be hampered in some cases by the DB not being available, since revoking a user’s RAC would need a copy of the RACs GUID, which is most easily obtained from the database.
Of these, the first two should normally be the limiting factors with the other issues being generally tolerable in most cases for short periods of time. Which means that if an AD RMS Database is down or inaccessible for a few minutes, or in many cases even for a few hours, most users won’t notice a problem and will continue to work mostly unaffected.
Does this mean that the AD RMS Database is not important? No! It is of uttermost importance to the AD RMS system. If the database is gone for good, so is your protected data, thus the RMS DB needs to have some reasonable level of protection from failure. But it does mean that the Database’s continuous availability is not as essential as it would be for many other applications relying on a database.
So let’s review some of the most obvious alternatives for protecting the RMS DB against failure.
- Setting up the DB on a Failover Cluster
- Providing a “warm standby” database via Log Shipping or some other similar mechanism
- Having a good (and functional) backup and restore policy
A Failover cluster seems like a good option at first glance, but when we look at it in more depth we can see it provides the sort of protection that AD RMS doesn’t really need. The strength of a failover cluster is that it provides almost immediate (sub-minute) recovery for hardware, operational or application failures, and it provides the ability to perform some types of server maintenance such as server patching without affecting the service. But a failover cluster doesn’t provide much protection against data-centric failures, such as a storage unit failure or an operational or software error that could corrupt the data. Granted, these events are very infrequent, or at least they should be, but that is the sort of event that would take a long time to recover. Since AD RMS isn’t seriously affected by brief interruptions in the database, having instantaneous recovery of the DB provides only marginal value, whereas something that protects the service against longer interruptions would be highly recommended. So we conclude that using a failover cluster for hosting the AD RMS DB is not the most efficient use of resources, as it is an expensive configuration that adds little value, and doesn’t protect us against the type of events that should concern us most.
A warm standby provides a different type of protection. While they can in theory recover the DB to a working state in a few minutes, most of these solutions require some manual intervention so it is a good assumption that up to one hour, sometimes more, could pass before the DB is fully recovered form a serious failure. But that’s generally not a problem for AD RMS. Users can typically go for one hour without Exchange pre-licensing as long as they can acquire a license when they want to consume the content. New users or users setting up a new machine during one specific hour when the service is operating on contingency mode shouldn’t be that many, and they will typically have other issues during the first few hours after installing the machine that not being able to activate AD RMS at the first try, while a nuisance, shouldn’t be a big problem. And during a period when you are trying to recover the AD RMS form a catastrophic database failure, not being able to create AD RMS consumption reports or modify policy templates should be the least of your concerns.
So we can see that this is a valid and very useful configuration that can protect us against the type of problem we should be concerned about at a generally lower cost than a failover cluster.
But so will a good backup strategy. Unless you have a backup solution that cannot recover a server within a few hours, you might be able to do quite well without a hot or warm standby solution. Of course, your backup system needs to be a well-oiled process, your backups need to be tested and you have to have some sort of hardware spares to rebuild the DB when needed. It is troubling to see how many companies have a thorough backup solution in place but don’t have a good recovery plan for when things fail. But as long as you can trust that your backups work and that you will be able to recover them to working hardware within a few hours (and maybe even apply your last transaction logs from the original database if the data is still accessible and you have the DB set to full recovery model) you should be good with that. AD RMS will continue to run after reconnecting to the DB and most users shouldn’t even notice the interruption.
One caveat here: in order to be able to recover the AD RMS DB in another system, you will need to have configured your AD RMS database to use a DNS alias to refer to the database server. This is a general recommendation that should always be followed with AD RMS: don’t call the Database server by its proper name, use an alias (CName or manual A record in DNS) during setup to point AD RMS to the DB server. You will regret it if you don’t as recovering from a server failure and maybe even during future upgrade and migration processes. And the same applies to the AD RMS servers themselves: always use an alias, and not the server’s own names for the AD RMS server URLs. This will avoid some grief in the future, guaranteed.
So now we know what we need to do to have an AD RMS database and a directory server that is resilient enough to provide support to an AD RMS server infrastructure that’s highly available. So what about the AD RMS servers themselves? Well, this post is long enough already, so let’s leave that for my next post.