Nobody likes disasters, and especially not large companies who rely heavily on their servers, and cannot afford even a short downtime of most services. When UAG was released, one of the most important new changes was the ability deploy the server in arrays. With a server array, the services UAG offers can provide higher availability, because if one server becomes inoperative, the other (or others) can still provide service. To make the best of this, it’s important to understand how UAG behaves in a situation like this, and how this would affect various scenarios that deal with problems.
Let’s start by defining a few key terms related to this topic (definition from the PC Magazine encyclopedia of computer terms)
- Load Balancing (LB)
The even distribution of processing across available resources such as servers in a network or disks in a storage area network (SAN). Load balancing might split incoming transactions evenly to all servers, or it may redirect transactions to the next available server as needed
- High Availability (HA)
Also called "RAS" (reliability, availability, serviceability) or "fault resilient," it refers to a multiprocessing system that can quickly recover from a failure. There may be a minute or two of downtime while one system switches over to another, but processing will continue. This is not the same as fault tolerant, in which redundant components are designed for continuous processing without skipping a heartbeat.
- Fault Tolerant (FT)
The ability to continue non-stop when a hardware failure occurs. A fault-tolerant system is designed from the ground up for reliability by building multiples of all critical components, such as CPUs, memories, disks and power supplies into the same computer. In the event one component fails, another takes over without skipping a beat.
- Disaster Recovery (DR)
A plan for duplicating computer operations after a catastrophe occurs, such as a fire or earthquake. It includes routine off-site backup as well as a procedure for activating vital information systems in a new location.
As you can see, there are some differences between these terms, and so a system that offers “High Availability” may not be fault tolerant or allow for disaster recovery. On a similar note, a system designed for Disaster Recovery may not have fault tolerance.
The array function that UAG includes was designed primarily with the goal of providing Load Balancing. This allows an organization to deploy multiple UAG servers, so they can service a number of users that is higher than what a single UAG server can handle. High Availability, Fault Tolerance and Disaster recovery, though, are not part of that design, and this is important to understand so that planning can match the product capabilities.
The challenge with High Availability, Fault Tolerance and Disaster recovery stems from the fact that UAG runs a tight ship, part of which is session management. When a client connects to a UAG server, UAG establishes a unique session with it, and that session is server and client specific, and it does not apply to other servers within the same array. This not unique to UAG, of course – most server products are designed like that.
If a UAG server fails, the sessions it had are gone, and even though Load Balancing automatically switches users to another member of the array, they need to establish new sessions. The user experience in such a situation is that once the user posts some request to UAG after the server failure, the load balancing mechanism sends the request to the “new” server, but that server doesn’t have an existing session. It then redirects the user to initiate a new session and login, and so the user gets to the login page. Depending on the design of the backend application, the user might return to the same point he was on before the switch, but he may also end up on the applications’ initial page, or maybe even on an error page.
The design I outlined above applies when your UAG servers are part of an array (no matter if the load balancing is done using Windows NLB or an external Load Balancer). However, if your UAG servers are not part of an array, then things may get more complicated. If one server fails over to another server, which is not part of the same array, this could lead to a cookie decryption error, which can happen when UAG attempts to decode a cookie that was encoded by a foreign server.
Cookie encryption and decryption are a normal part of UAGs operation. When UAG receives pages from backend servers, they often contain cookies that the backend server creates as part of the application (virtually all web applications have some degree of cookie usage). UAG then delivers the cookies to the client, which would deliver back to UAG with subsequent requests to the same application. In such subsequent requests, UAG needs to know from which server these cookies originated. This is important, because UAG usually publishes multiple servers. If it mistakenly delivered a cookie from Server A to Server B, it could cause problems, because if these cookies contain application-specific data, the application code could choke on it or confuse one user for another.
To be able to identify which server a cookie belongs to, UAG encodes the cookies with unique name that can help it identify the backend server. However, if the cookie was encoded by a UAG server that is not a member of the same array; the decoding process might fail, ending up with a deformed cookie name. Here’s an example of such a situation from a UAG trace:
10a0.fb8 10/16/2012-18:11:24.692 [whlfiltsecureremote CParserRequestHeader::AnalyzeCookieElements ParserRequestHeader.cpp@1970] Info:AnalyzeCookieElements(localhost, /InternalSite/InitParams.aspx?referrer=/InternalSite/Login.asp&resource%5Fid=65EA7D5055F1446FA96DFC12323352BD&login%5Ftype=2&site%5Fname=store&secure=1&orig%5Furl=https%3A%2F%2Forder%2ecreatehive%2ecom/OA%5FHTML%2FibeCZzpHome%2Ejsp) cookie elements = Name = uniquesig6052BEDBEC0A1F78BC1A14E2A48D3901D99C6D86F7F20CB4FE3C45ED78A52189EB90C0CE8FDA30159046F82EBAF77128, value = AEGGAGEAAJIIPIECBPDGNMGC
What happens here is that UAG sees the cookie and identifies it as a UAG signed cookie, and then tries to decrypt it (this is an ASP session cookie that was generated by an IIS site, by the way). The decryption ends up with a malformed cookie name Z A-A™Tc/WŠ"8PQDSB. UAG then inserts the cookie back into the data stream and continues. However, later on, when the IIS server that UAG runs on tries to process the request, it would fail; because the cookie name is invalid (cookie names cannot contain non alphanumeric characters). This causes IIS to throw a 400 error (“Bad request”). Depending on what type of request is being services, it could lead to other errors. Other possible symptoms of this type of thing are 500 errors and 404.15 errors thrown by IIS, as well as error 152 thrown by UAG itself (“You have authenticated successfully using Active Directory Federated Services (ADFS), but your user name or group cannot be located in a required Forefront UAG local group."). It could also manifest itself with bad page rendering (which happens because the client is unsuccessful in retrieving some of the files required to display the page from UAG). Another variation of this problem is that since this situation causes a huge number of errors, it puts a higher than normal load on the UAG servers, which could cause performance degradation even with a number of users that is lower than what they would normally be able to handle.
One situation where session failover is smooth is where the connection to UAG is not done via a browser, such as ActiveSync, Outlook Anywhere and DirectAccess. These type of applications work differently, and have built-in mechanisms to handle session failover in a way that is virtually transparent to the user.
Diagnosing a session failover problem
As I said, in case of an array member failure, users would be sent to a login page, and this is perfectly normal and should be expected. It’s possible that in such a situation, the application in use may have it’s own challenges with handling this, so we recommend guiding users to close and re-open their browser in such a situation. Doing so would clear all session cookies, and that would prevent application-level issues, in case an application doesn’t handle this situation well.
If you are running into symptoms such as described above (seemingly random HTTP errors 400, 404.15 and 500, UAG error 152, bad page rendering), the first step is to take a trace on the UAG servers. It is sufficient to trace the whlfiltsecureremote component. Within the traces, look for the text “Info:Decrypted cookie name is”. If the results show cookie names that are gibberish (contain non alphanumeric characters), then this is a clear indication of a failover problem.
How to handle session failover
From a planning perspective, if you need to use multiple UAG servers, they should be members of one array. A UAG array requires that all servers be on the same IP subnet, and share a fast and reliable connection for array synchronization. This means, of course, that you should not deploy UAG array members in multiple locations. Doing so would interrupt intra-array communications and could lead to problems with array synchronization. Deploying UAGs that are not members of the same array in a way that might lead to sessions being moved from one to another could lead to cookie decryption errors, and so it is unsupported and should be avoided.
Another thing to consider with regards to deploying arrays is that preserving session integrity is important for reliable operation. If you are using an external load balancer, you should make sure it doesn’t move users from one UAG to another unless the server has actually failed. Load balancers have various options for affinity, and this should be configured to be equal or greater than the UAG’s maximum session time (by default it is 24 hours). Another thing that might require configuration on your load balancer is connection-optimization. Many load balancers try to optimize the network by severing connections that seem to be running too long (the default would typically be to reset a connection after between 20 and 30 minutes). Since UAG sessions are often longer, this could lead to various errors, as the connection might reset in the middle of a page load. We recommend disabling such features, or setting the timeout to be equal or greater than UAG’s maximum session time as well.