Concerning Trends Discovered During Several Critical Escalations

Ross Smith IV · ‎Jan 08 2015

Over the last several months, I have been involved in several critical customer escalations (what we refer to as critsits) for Exchange 2010 and Exchange 2013. As a result of my involvement, I have noticed several common themes and trends. The intent of this blog post is to describe some of these common issues and problems, and hopefully this post will lead you to come to the same conclusion that I have – that many of these issues could have been avoided by taking sensible, proactive steps.

Software Patching

By far, the most common issue was that almost every customer was running out-of-date software. This included OS patches, Exchange patches, Outlook client patches, drivers, and firmware. One might think that being out-of-date is not such a bad thing, but in almost every case, the customer was experiencing known issues that were resolved in current releases. Maintaining currency also ensures an environment is protected from known security defects. In addition, as the software version ages, it eventually goes out of support (e.g., Exchange Server 2010 Service Pack 2).

Software patching is not simply an issue for Microsoft software. You must also ensure that all inter-dependent solutions (e.g., Blackberry Enterprise Server, backup software, etc.) are kept up-to-date for a specific release as this ensures optimal reliability and compatibility.

Microsoft recommends adopting a software update strategy that ensures all software follows N to N-1 policy, where N is a service pack, update rollup, cumulative update, maintenance release, or whatever terminology is used by the software vendor. We strongly recommend that our customers also adopt a similar strategy with respect to hardware firmware and drivers ensuring that network cards, BIOS, and storage controllers/interfaces are kept up to date.

Customers must also follow the software vendor’s Software Lifecycle and appropriately plan on upgrading to a supported version in the event that support for a specific version is about to expire or is already out of support.

For Exchange 2010, this means having all servers deployed with Service Pack 3 and either Rollup 7 or Rollup 8 (at the time of this writing). For Exchange 2013, this means having all servers deployed with Cumulative Update 6 or Cumulative Update 7 (at the time of this writing).

For environments that have a hybrid configuration with Office 365, the servers participating in the hybrid configuration must be running the latest version (e.g., Exchange 2010 SP3 RU8 or Exchange 2013 CU7) or the prior version (e.g., Exchange 2010 SP3 RU7 or Exchange 2013 CU6) in order to maintain and ensure compatibility with Office 365. There are some required dependencies for hybrid deployments, so it’s even more critical you keep your software up to date if you choose to go hybrid.

Change Control

Change control is a critical process that is used to ensure an environment remains healthy. Change control enables you to build a process by which you can identify, approve, and reject proposed changes. It also provides a means by which you can develop a historical accounting of changes that occur. Often times I find that customers only leverage a change control process for “big ticket” items, and forego the change control process for what are deemed as “simple changes.”

In addition to building a change control process, it is also critical to ensure that all proposed changes are vetted in a lab environment that closely mirrors production, and includes any 3^rdparty applications you have integrated (the number of times I have seen Exchange get updated and heard the integrated app has failed is non-zero, to use a developer’s phrase).

While lab environments provide a great means to validate the functionality of a proposed change, they often do not provide a view on the scalability impact of a change. One way to address this is to leverage a “slice in production” where a change is deployed to a subset of the user population. This subset of the user population can be isolated using a variety of means, depending on the technology (e.g., dedicated forests, dedicated hardware, etc.). Within Office 365, we use slices in productions a variety of different ways; for example, we leverage them to test (or what we call dogfood) new functionality prior to customer release and we use it as a First Release mechanism so that customers can experience new functionality prior to worldwide deployment.

If you can’t build a scale impact lab, you should at a minimum build an environment that includes all of the component pieces you have in place, and make sure you keep it updated so you can validate changes within your core usage scenarios.

The other common theme I saw is bundling multiple changes together in a single change control request. While bundling multiple changes together may seem innocuous, when you are troubleshooting an issue, the last thing you want to do is make multiple changes. First, if the issue gets resolved, you do not know which particular change resolved the issue. Second, it is entirely possible the changes may exacerbate the current issue.

Complexity

Failure happens. There is no technology that can change that fact. Disks, servers, racks, network appliances, cables, power substations, pumps, generators, operating systems, applications, drivers, and other services – there is simply no part of an IT service that is not subject to failure.

This is why we use built-in redundancy to mitigate failures. Where one entity is likely to fail, two or more entities are used. This pattern can be observed in Web server arrays, disk arrays, front-end and back-end pools, and the like. But redundancy can be prohibitively expensive (as a simple multiplication of cost). For example, the cost and complexity of the SAN-based storage system that was at the heart of Exchange until the 2007 release, drove the Exchange Team to evolve Exchange to integrate key elements of storage directly into its architecture. Every SAN system and every disk will ultimately fail, and implementing a highly-redundant system using SAN technology is cost-prohibitive, so Exchange evolved from requiring expensive, scaled-up, high-performance storage systems, to being optimized for commodity scaled-out servers with commodity low-performance SAS/SATA drives in a JBOD configuration with commodity disk controllers. This architecture enables Exchange to be resilient to any storage failure.

By building a replication architecture into Exchange and optimizing Exchange for commodity hardware, failure modes are predictable from a hardware perspective, and that redundancy can removed from other hardware layers, as well. Redundant NICs, redundant power supplies, etc., can also be removed from the server hardware. Whether it is a disk, a controller, or a motherboard that fails, the end result is the same: another database copy is activated on another server.

The more complex the hardware or software architecture, the more unpredictable failure events can be. Managing failure at scale requires making recovery predictable, which drives the necessity for predictable failure modes. Examples of complex redundancy are active/passive network appliance pairs, aggregation points on a network with complex routing configurations, network teaming, RAID, multiple fiber pathways, and so forth.

Removing complex redundancy seems counter-intuitive – how can removing hardware redundancy increase availability? Moving away from complex redundancy models to a software-based redundancy model creates a predictable failure mode.

Several of my critsit escalations involved customers with complex architectures where components within the architecture were part of the syst emic issue trying to be resolved:

Load balancers were not configured to use round robin or least connection management for Exchange 2013. Customers that did implement least connection management, did not have the “slow start” feature enabled. Slow start ensures that when a server is returned to a load-balanced pool, it is not immediately flooded with connections. Instead, the connections are slowly ramped up on that server. If your load balancer does not provide a slow start function for least connection management, we strongly recommend using round robin connection management.
Hypervisor hosts were not configured in accordance with vendor recommendations for large socket/pCPU machines.
Firewalls between Exchange servers, Active Directory servers, or Lync servers. As discussed in Exchange, Firewalls, and Support…Oh, my!, Microsoft does not support configurations when Exchange servers have network port restrictions that interfere with communicating with other Exchange servers, Active Directory servers, or Lync servers.
Ensuring the correct file-based anti-virus exclusions are in place.
Deploying asymmetric designs in a “failover datacenter.” In all instances, there were fewer servers in the failover datacenter than the primary datacenter. The logic used in designing these architectures was that the failover datacenter would only be used during maintenance activities or during catastrophic events. The fundamental flaw in this logic is that it assumes there will not be 100% user activity. As a result, users are affected by higher response latencies, slower mail delivery, and other performance issues when the failover datacenter is activated.
SSL offloading (another supported, but rarely recommended scenario) was not configured per our guidance.
Storage area networks were not designed to deliver the capacity and IO requirements necessary to support the messaging environment. We have seen customers invest in tiered storage to help Exchange and other applications; however, due to the way the Extensible Storage Engine and the Managed Store work and the random nature of the requests being made, tiered storage is not beneficial for Exchange. The IO is simply not available when needed.

How can the complexity be reduced? For Exchange, we use predictable recovery models (for example, activation of a database copy). Our Preferred Architecture is designed to reduce complexity and deliver a symmetrical design that ensures that the user experience is maintained when failures occur.

Ignoring Recommendations

Another concerning trend I witnessed is that customers repeatedly ignored recommendations from their product vendors. There are many reasons I’ve heard to explain away why a vendor’s advice about configuring or managing their own product was ignored, but it’s rare to see a case where a customer honestly knows more about how a vendor’s product works than does the vendor. If the vendor tells you to configure X or update to version Y, chances are they are telling you for a reason, and you would be wise to follow that advice and not ignore it.

Microsoft’s recommendations are grounded upon data- the data we collect during a support call, the data we collect during a Risk Assessment, and the data we get from you. All of this data is analyzed before recommendations are made. And because we have a lot of customers, the collective learnings we get from you plays a big part.

Deployment Practices

When deploying a new version of software, whether it's Exchange or another product, it's important to follow an appropriate deployment plan. Customers that don't take on the unnecessary risk of running into unexpected issues during the deployment.

Proper planning of an Exchange deployment is imperative. At a minimum, any deployment plan you use should include the following steps:

Identify the business and technical requirements that need to be solved.
You'll need to know your peak usage time(s) and you will collect IO and message profile data during your peak usage time(s).
Design a solution based on the requirements and data collected.
Then, you use the Exchange Server Role Requirements Calculator to model the design based on this collected data and any extrapolations required for your design.
Then, you'll procure the necessary hardware based on the calculator output, design choices, and leverage the advice of your hardware vendor.
Next, you'll configure the hardware according to your design.
Before going into production, you'll validate the storage system with Jetstress (following the recommendations in the Jetstress Field Guide) to verify that your storage configuration can meet the requirements defined in the calculator.
Once the hardware has been validated you can deploy a pilot that mirrors your expected production load.
Be sure to collect performance data and analyze it. Verify that the data matches your theoretical projections. If the pilot requires additional hardware to meet the demands of the user base, optimize the design accordingly.
Deploy the optimized design and start onboarding the remainder of your users.
Continue collecting data and analyzing it, and adjust if changes occur.

The last step is important. Far too often, I see customers implement an architecture and then question why the system is overloaded. The landscape is constantly evolving. Years ago, bring your own device (BYOD) was not an option in many customer environments, whereas, now it is becoming the norm. As a result, your messaging environment is constantly changing – users are adapting to the larger mailbox quotas, the proliferation of devices, the capabilities within the devices, etc. These changes affect your design and can consume more resources. In order to account for this, you must baseline, monitor, and evaluate how the system is performing and make changes, if necessary.

Historical Data

To run a successful service at any scale, you must be able to monitor the solution to not only identify issues as they occur in real-time, but to also proactively predict and trend how the user base or user base activity is growing. Performance, event log and protocol logging data provides two valuable functions:

It allows you to trend and determine how your users’ message profile evolves over time.
When an issue occurs, it allows you to go back in time and see whether there were indicators that were missed.

The data collected can also be used to build intelligent reports that expose the overall health of the environment. These reports can then be shared at monthly service reviews that outline the health and metrics, actions taken within the last month, plans for the next month, issues occurring within the environment and steps being taken to resolve the issues.

If you do not have a monitoring solution capable of collecting and storing historical data, you can still collect the data you need.

Exchange 2013 captures performance data automatically and stores it in the Microsoft\Exchange Server \V15\Logging\Diagnostics\DailyPerformanceLogs folder. If you are not running Exchange 2013, you can use Experfwiz to capture the data.
Event logs capture all relevant events that Exchange writes natively. Unfortunately, I often see customers configure Event logs to flush after a short period of time (one day). Event logs should collect and retain information for one week at a minimum.
Exchange automatically writes a ton of useful information into protocol logs that can tell you how your users and their devices behave. Log Parser Studio 2.2 provides means to interact with this data easily.
Message tracking data is stored on Hub Transport servers and/or Mailbox servers and provides a wealth of information on the message flow in an environment.

Summary

As I said at the beginning of this article, many of these customer issues could have been avoided by taking sensible, proactive steps. I hope this article inspires you to investigate how many of these might affect your environments, and more importantly, to take steps to resolve them, before you are my next critsit escalation.

Ross Smith IV
Principal Program Manager
Office 365 Customer Experience

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs