Concerning Trends Discovered During Several Critical Escalations


Over the last several months, I have been involved in several critical customer escalations (what we refer to as critsits) for Exchange 2010 and Exchange 2013. As a result of my involvement, I have noticed several common themes and trends. The intent of this blog post is to describe some of these common issues and problems, and hopefully this post will lead you to come to the same conclusion that I have – that many of these issues could have been avoided by taking sensible, proactive steps.

Software Patching

By far, the most common issue was that almost every customer was running out-of-date software. This included OS patches, Exchange patches, Outlook client patches, drivers, and firmware. One might think that being out-of-date is not such a bad thing, but in almost every case, the customer was experiencing known issues that were resolved in current releases. Maintaining currency also ensures an environment is protected from known security defects. In addition, as the software version ages, it eventually goes out of support (e.g., Exchange Server 2010 Service Pack 2).

Software patching is not simply an issue for Microsoft software. You must also ensure that all inter-dependent solutions (e.g., Blackberry Enterprise Server, backup software, etc.) are kept up-to-date for a specific release as this ensures optimal reliability and compatibility.

Microsoft recommends adopting a software update strategy that ensures all software follows N to N-1 policy, where N is a service pack, update rollup, cumulative update, maintenance release, or whatever terminology is used by the software vendor. We strongly recommend that our customers also adopt a similar strategy with respect to hardware firmware and drivers ensuring that network cards, BIOS, and storage controllers/interfaces are kept up to date.

Customers must also follow the software vendor’s Software Lifecycle and appropriately plan on upgrading to a supported version in the event that support for a specific version is about to expire or is already out of support.

For Exchange 2010, this means having all servers deployed with Service Pack 3 and either Rollup 7 or Rollup 8 (at the time of this writing). For Exchange 2013, this means having all servers deployed with Cumulative Update 6 or Cumulative Update 7 (at the time of this writing).

For environments that have a hybrid configuration with Office 365, the servers participating in the hybrid configuration must be running the latest version (e.g., Exchange 2010 SP3 RU8 or Exchange 2013 CU7) or the prior version (e.g., Exchange 2010 SP3 RU7 or Exchange 2013 CU6) in order to maintain and ensure compatibility with Office 365. There are some required dependencies for hybrid deployments, so it’s even more critical you keep your software up to date if you choose to go hybrid.

Change Control

Change control is a critical process that is used to ensure an environment remains healthy. Change control enables you to build a process by which you can identify, approve, and reject proposed changes. It also provides a means by which you can develop a historical accounting of changes that occur. Often times I find that customers only leverage a change control process for “big ticket” items, and forego the change control process for what are deemed as “simple changes.”

In addition to building a change control process, it is also critical to ensure that all proposed changes are vetted in a lab environment that closely mirrors production, and includes any 3rdparty applications you have integrated (the number of times I have seen Exchange get updated and heard the integrated app has failed is non-zero, to use a developer’s phrase).

While lab environments provide a great means to validate the functionality of a proposed change, they often do not provide a view on the scalability impact of a change. One way to address this is to leverage a “slice in production” where a change is deployed to a subset of the user population. This subset of the user population can be isolated using a variety of means, depending on the technology (e.g., dedicated forests, dedicated hardware, etc.). Within Office 365, we use slices in productions a variety of different ways; for example, we leverage them to test (or what we call dogfood) new functionality prior to customer release and we use it as a First Release mechanism so that customers can experience new functionality prior to worldwide deployment.

If you can’t build a scale impact lab, you should at a minimum build an environment that includes all of the component pieces you have in place, and make sure you keep it updated so you can validate changes within your core usage scenarios.

The other common theme I saw is bundling multiple changes together in a single change control request. While bundling multiple changes together may seem innocuous, when you are troubleshooting an issue, the last thing you want to do is make multiple changes. First, if the issue gets resolved, you do not know which particular change resolved the issue. Second, it is entirely possible the changes may exacerbate the current issue.

Complexity

Failure happens. There is no technology that can change that fact. Disks, servers, racks, network appliances, cables, power substations, pumps, generators, operating systems, applications, drivers, and other services – there is simply no part of an IT service that is not subject to failure.

This is why we use built-in redundancy to mitigate failures. Where one entity is likely to fail, two or more entities are used. This pattern can be observed in Web server arrays, disk arrays, front-end and back-end pools, and the like. But redundancy can be prohibitively expensive (as a simple multiplication of cost). For example, the cost and complexity of the SAN-based storage system that was at the heart of Exchange until the 2007 release, drove the Exchange Team to evolve Exchange to integrate key elements of storage directly into its architecture. Every SAN system and every disk will ultimately fail, and implementing a highly-redundant system using SAN technology is cost-prohibitive, so Exchange evolved from requiring expensive, scaled-up, high-performance storage systems, to being optimized for commodity scaled-out servers with commodity low-performance SAS/SATA drives in a JBOD configuration with commodity disk controllers. This architecture enables Exchange to be resilient to any storage failure.

By building a replication architecture into Exchange and optimizing Exchange for commodity hardware, failure modes are predictable from a hardware perspective, and that redundancy can removed from other hardware layers, as well. Redundant NICs, redundant power supplies, etc., can also be removed from the server hardware. Whether it is a disk, a controller, or a motherboard that fails, the end result is the same: another database copy is activated on another server.

The more complex the hardware or software architecture, the more unpredictable failure events can be. Managing failure at scale requires making recovery predictable, which drives the necessity for predictable failure modes. Examples of complex redundancy are active/passive network appliance pairs, aggregation points on a network with complex routing configurations, network teaming, RAID, multiple fiber pathways, and so forth.

Removing complex redundancy seems counter-intuitive – how can removing hardware redundancy increase availability? Moving away from complex redundancy models to a software-based redundancy model creates a predictable failure mode.

Several of my critsit escalations involved customers with complex architectures where components within the architecture were part of the systemic issue trying to be resolved:

  1. Load balancers were not configured to use round robin or least connection management for Exchange 2013. Customers that did implement least connection management, did not have the “slow start” feature enabled. Slow start ensures that when a server is returned to a load-balanced pool, it is not immediately flooded with connections. Instead, the connections are slowly ramped up on that server. If your load balancer does not provide a slow start function for least connection management, we strongly recommend using round robin connection management.
  2. Hypervisor hosts were not configured in accordance with vendor recommendations for large socket/pCPU machines.
  3. Firewalls between Exchange servers, Active Directory servers, or Lync servers. As discussed in Exchange, Firewalls, and Support…Oh, my!, Microsoft does not support configurations when Exchange servers have network port restrictions that interfere with communicating with other Exchange servers, Active Directory servers, or Lync servers.
  4. Ensuring the correct file-based anti-virus exclusions are in place.
  5. Deploying asymmetric designs in a “failover datacenter.” In all instances, there were fewer servers in the failover datacenter than the primary datacenter. The logic used in designing these architectures was that the failover datacenter would only be used during maintenance activities or during catastrophic events. The fundamental flaw in this logic is that it assumes there will not be 100% user activity. As a result, users are affected by higher response latencies, slower mail delivery, and other performance issues when the failover datacenter is activated.
  6. SSL offloading (another supported, but rarely recommended scenario) was not configured per our guidance.
  7. Storage area networks were not designed to deliver the capacity and IO requirements necessary to support the messaging environment. We have seen customers invest in tiered storage to help Exchange and other applications; however, due to the way the Extensible Storage Engine and the Managed Store work and the random nature of the requests being made, tiered storage is not beneficial for Exchange. The IO is simply not available when needed.

How can the complexity be reduced? For Exchange, we use predictable recovery models (for example, activation of a database copy). Our Preferred Architecture is designed to reduce complexity and deliver a symmetrical design that ensures that the user experience is maintained when failures occur.

Ignoring Recommendations

Another concerning trend I witnessed is that customers repeatedly ignored recommendations from their product vendors. There are many reasons I’ve heard to explain away why a vendor’s advice about configuring or managing their own product was ignored, but it’s rare to see a case where a customer honestly knows more about how a vendor’s product works than does the vendor. If the vendor tells you to configure X or update to version Y, chances are they are telling you for a reason, and you would be wise to follow that advice and not ignore it.

Microsoft’s recommendations are grounded upon data– the data we collect during a support call, the data we collect during a Risk Assessment, and the data we get from you. All of this data is analyzed before recommendations are made. And because we have a lot of customers, the collective learnings we get from you plays a big part.

Deployment Practices

When deploying a new version of software, whether it's Exchange or another product, it's important to follow an appropriate deployment plan. Customers that don't take on the unnecessary risk of running into unexpected issues during the deployment.

Proper planning of an Exchange deployment is imperative. At a minimum, any deployment plan you use should include the following steps:

  1. Identify the business and technical requirements that need to be solved.
  2. You'll need to know your peak usage time(s) and you will collect IO and message profile data during your peak usage time(s).
  3. Design a solution based on the requirements and data collected.
  4. Then, you use the Exchange Server Role Requirements Calculator to model the design based on this collected data and any extrapolations required for your design.
  5. Then, you'll procure the necessary hardware based on the calculator output, design choices, and leverage the advice of your hardware vendor.
  6. Next, you'll configure the hardware according to your design.
  7. Before going into production, you'll validate the storage system with Jetstress (following the recommendations in the Jetstress Field Guide) to verify that your storage configuration can meet the requirements defined in the calculator.
  8. Once the hardware has been validated you can deploy a pilot that mirrors your expected production load.
  9. Be sure to collect performance data and analyze it. Verify that the data matches your theoretical projections. If the pilot requires additional hardware to meet the demands of the user base, optimize the design accordingly.
  10. Deploy the optimized design and start onboarding the remainder of your users.
  11. Continue collecting data and analyzing it, and adjust if changes occur.

The last step is important. Far too often, I see customers implement an architecture and then question why the system is overloaded. The landscape is constantly evolving. Years ago, bring your own device (BYOD) was not an option in many customer environments, whereas, now it is becoming the norm. As a result, your messaging environment is constantly changing – users are adapting to the larger mailbox quotas, the proliferation of devices, the capabilities within the devices, etc. These changes affect your design and can consume more resources. In order to account for this, you must baseline, monitor, and evaluate how the system is performing and make changes, if necessary.

Historical Data

To run a successful service at any scale, you must be able to monitor the solution to not only identify issues as they occur in real-time, but to also proactively predict and trend how the user base or user base activity is growing. Performance, event log and protocol logging data provides two valuable functions:

  1. It allows you to trend and determine how your users’ message profile evolves over time.
  2. When an issue occurs, it allows you to go back in time and see whether there were indicators that were missed.

The data collected can also be used to build intelligent reports that expose the overall health of the environment. These reports can then be shared at monthly service reviews that outline the health and metrics, actions taken within the last month, plans for the next month, issues occurring within the environment and steps being taken to resolve the issues.

If you do not have a monitoring solution capable of collecting and storing historical data, you can still collect the data you need.

  • Exchange 2013 captures performance data automatically and stores it in the Microsoft\Exchange Server \V15\Logging\Diagnostics\DailyPerformanceLogs folder. If you are not running Exchange 2013, you can use Experfwiz to capture the data.
  • Event logs capture all relevant events that Exchange writes natively. Unfortunately, I often see customers configure Event logs to flush after a short period of time (one day). Event logs should collect and retain information for one week at a minimum.
  • Exchange automatically writes a ton of useful information into protocol logs that can tell you how your users and their devices behave. Log Parser Studio 2.2 provides means to interact with this data easily.
  • Message tracking data is stored on Hub Transport servers and/or Mailbox servers and provides a wealth of information on the message flow in an environment.

Summary

As I said at the beginning of this article, many of these customer issues could have been avoided by taking sensible, proactive steps. I hope this article inspires you to investigate how many of these might affect your environments, and more importantly, to take steps to resolve them, before you are my next critsit escalation.

Ross Smith IV
Principal Program Manager
Office 365 Customer Experience

Comments (22)
  1. Yuhong Bao says:

    Not that it is a good idea, but I do wonder if Custom Support for older Exchange 2010 Service Packs exist.

  2. Anon-1 says:

    @Ross, Thank you for the Article.
    As an Exchange Server customer, it would be great if Microsoft could focus on Exchange Server CUs & RUs QA. If Microsoft can focus on Exchange Server CUs & RUs QA, us Exchange Server customers will Not be behind in CUs & RUs. I hope you can run this by "Upper
    Management".

  3. Anonymous says:

    Labs can be very cost prohibitive especially if you run an Exchange hosting business based on private cloud and resource forest model. Also, labs don’t always allow you to simulate traffic and load.

    While upgrading a 2007 customer to 2013 we hit an issue where the 2013 servers would send almost all of the traffic to one 2007 CAS for the un-migrated mailboxes. The legacy CAS became overloaded and started refusing connections leading to a mass outage. After
    a 20 hour critsit with MS support we were told there was an "undocumented fix" that was included in 2013 SP1 which had been released 2 weeks prior.

    After installing CU6 for 2013 in November we hit a problem where mailboxes databases failed over multiple times during the day. We had to install an Interim Update to fix it (http://support.microsoft.com/kb/2997209).
    The Interim Update caused problems with the OWA UI when users would select Options. We now have to uninstall the interim update before CU7 can be applied.

    Both these problems could have been avoided if Microsoft had tested the 2013 CU more thoroughly before releasing. It appears that because you don’t have any co-existence in O365 you don’t test it so on-premise customers are left to test in the field.

    There is fine balance between being on the latest and greatest release and a tried and trusted release!

    Have to admit I’m apprehensive about installing CU7 for 2013, but I have a group of Russian users who are becoming more vocal about their DST update.

    Mitchell.

  4. @Nino Thanks for that link. I remember hearing about the session but I wasn’t at TechEd & didn’t watch it afterwards. Really good stuff! Would be great to eventually have the teched article for reference during perf cases.

  5. @Ross Thanks for your reply.

    1. I understand. Unfortunately this makes it harder for customers to keep track of all the changes and latest recommendations. Maybe not for dedicated Exchange consultant I preferred the previous model with all information on TechNet and using the blog to announce
    changes.

    2. Thanks for rewording the article, good to see that there’s no change in the recommendation here.

    3. I fully agree and see the exact same issues with customers I visit. I’m pleasantly surprised when customer is on N-2 or N-3. :-)

  6. Nino Bilic says:

    @ ASHigginbotham: actually, there is some work ongoing to address some of this, you are not the only one asking :). I am still not totally sure which format this is going to take, but there are people working to consolidate some of that information for
    Exchange 2013.

    In the mean time – just wanted to make sure that you have seen this:
    http://channel9.msdn.com/Events/TechEd/NorthAmerica/2014/OFC-B321

  7. Paul Newell says:

    This is a good blog, Ross. Unfortunately I fear the people that really need this information are the people who won’t be reading it (e.g. the "Generalist" network administrator who doesn’t focus on Exchange), though this is a great starting point for people
    to read when starting the troubleshooting processes.

    Additionally, you learn something new every day! Specifically, that SSL offloading is generally not recommended. Out of curiosity, why is that? Potential security holes? Issues with implementation from one load balancer make/model to another?

  8. @Paul – more often than not, you have more than enough CPU to handle the SSL processing on the Exchange servers (it’s baked into our guidance). There are load balancing solutions that terminate SSL at the load balancer, but require an extra device/hardware
    to re-encrypt the SSL before sending it to the destination server. So in this scenario, the customer either has to purchase another device, or leave it in the clear.

    SSL bridging is a far simpler solution (and more secure as it ensures no nefarious individual within the environment is sniffing the network and seeing unencrypted data).

    Ross

  9. Yuhong Bao says:

    @MitchMG: I think CU7 is going well, after being delayed by a month, though it is unfortunate they forgot about the Exchange 2010 update rollup.

  10. Very interesting article, thanks. For capacity planning it’s a shame that the "Understanding Exchange Performance" section on TechNet never was written for Exchange 2013. This was a very comprehensible section to understand processor, memory and storage
    configurations as well as multi-role deployments for Exchange 2007 and 2010. Customers could use this information to design their solution. Historically the Exchange Calculator sheet was meant to validate the design.

    For Exchange 2013 this information was replaced by a single blog post (http://blogs.technet.com/b/exchange/archive/2013/05/06/ask-the-perf-guy-sizing-exchange-2013-deployments.aspx)
    and now the recommendation is to use the Calculator sheet to *design* your environment. (bullet #2 under the Deployment Practices section of this article).

    So one could say that the Exchange team contributed to badly designed Exchange 2013 environments by failing to supply the required information in a similar way to the Exchange 2007 and 2010 documentation.

    Another thing is the N-1 updating policy. This is a complex area and there are many reasons why organizations currently cannot implement such a policy. One of the reasons is that every Exchange 2013 CU is an SP and cannot be rolled back. The second one is of
    course that the QA of released updates still keeps failing, most recently with Exchange 2013 CU6 and UR8 for 2010 SP3. So if you want customers to follow that policy, keep investing in the quality of the updates and try to make the update process less impact
    for customers, for instance by providing a roll-back option.

    @Paul Newel: Not often recommended is not the same as generally not recommended.

  11. @Jetze –

    1. We chose to release the performance guidance via the blog to enable us to communicate quicker and adopt changes at a faster pace (TechNet articles now link to it). In fact, we just updated the article the other day to add clarity with hyperthreading and
    virtualization. We have taken an approach with this release to be more explicit with design recommendations (the Preferred Architecture), as opposed to how we provided guidance in previous releases. You can expect this to continue in the future, as well.

    2. The calculator is and has always been a modeling tool to help determine the proper layout and configuration based on chosen design parameters. I’ve updated the above section to make this clearer, thanks for the feedback.

    3. The reason I touched up software patching is because there are customers still running on really old releases. I know this because I had to deal with several of these customers, recently. In one instance, the outage was due to two code defects in the E2010
    SP2 codebase that were corrected in SP3 and were publically documented. Adding further complication was that this customer was not even running the most current rollup release for the service pack (RU4 instead of RU8). In other words, this customer was running
    a version of software from August 2012 in 2014. Their reasoning was that they wait one year after the release of a new service pack. Unfortunately, they couldn’t even follow their own protocol (SP3 was released February 2013). This was simply poor operational
    practice.

    Ross

  12. Great post. To piggy back on Jetze’s comments, I would really like a 2013 version of these articles:

    http://technet.microsoft.com/en-us/library/dd335215(v=exchg.141).aspx

    http://technet.microsoft.com/en-us/library/ff367871(v=exchg.141).aspx

    Namely the relevant Exchange 2013 counters, along with their expected/acceptable values. This would greatly help many of us analyze & diagnose performance issues.

  13. Nino Bilic says:

    @MitchMG and Anon-1: I don’t think you will get any argument from us that WE must ensure Exchange releases are trouble-free. Having been close to support for years though – I think the problem Ross is describing is much more than "being gun-shy on latest
    Exchange releases". In some cases we are talking being a year or more behind on Windows updates, clients or yes – Exchange. While on a certain level, a very good argument for caution and careful testing can be made for any of those ‘levels’ of patching, all
    we are saying is that we have seen cases where that has been taken too far. The number of support cases I have seen because network card drivers were 2-3 years out of date… (just picking on an example).

    Again, you both definitely have a point.

  14. shudson says:

    Nice article Ross, I guess most of us are guilty of one or two ‘oversights’ during design, deployment and on-going support.

  15. Joe Palarchio says:

    Agreed, we see the same concerns in the field. The rate of change across all systems has in many cases overwhelmed the organization’s IT staff. Fortunately, with the clients I see it, we’re moving them to Office 365 and these management tasks are largely
    reduced for them.

    For organizations not willing or staffed to properly manage an enterprise mail platform, this is one the large benefits of moving to a service-based mail platform.

    It’s not that Office 365 alleviates all of these tasks or is the answer to all scenarios but it should certainly be considered by those not doing the above.

  16. filipp says:

    Somehow this post makes me angry.
    The Idea to publish a collection of frequent errors is good. But _my_ feeling is, that "By far, the most common issue"(Cite) is that the Quality of Microsoft Software gets worse with every Release. So it’s not a good idea to point at the customers who are "to
    stupid" to apply updates or track Changes or build a Test-Lab (in former times, MS tested the Software before release…)

    Regards

    Filipp

  17. mike says:

    @Nino – Staying up-to-date with drivers and firmware alongside adequately testing can be a real challenge for some organizations, especially smaller ones. It’s much easier to test software updates because a test environment can be built relatively easily
    with VMs or spare hardware. For drivers and especially firmware, some organizations might not have a exact duplicate spare box around that they can wrench on.

    Yes, yes, I know some will say that small organizations should go fully cloud-based, but we all know that isn’t the current day-to-day reality.

    By the way, I’m not saying drivers and firmware updates should be ignored, but most IT pros are weary of any updates these days to begin with. A firmware update that can literally brick a box that would then have to be fixed during the same maintenance window
    that’s rapidly closing combined with an IT team working the problem (sometimes after working a full day) isn’t an experience anyone wants to repeat.

  18. Ali Yussuf says:

    Another awesome, well written blog Ross!

  19. I’ll usually ignore recommendations that I disagree with. If your recommendation makes something more complicated, more expensive, or is just abnormal (e.g. the ongoing narrative regarding SSL certificate naming, which eventually made it to a fairly logical
    place after the CAs put their foot down), I need to know exactly why the simpler, cheaper, or industry-standard method won’t work. If I don’t get that information, I’ll try it my way first. If there’s a problem, then I get my answer. (And when the customer
    asks me why we didn’t do it the cheaper way, now I can actually tell them why.) If there’s no problem, great! Sometimes the vendor will eventually acknowledge that the simpler way is fine.

    But I’ve also never had a critical issue caused by my antics. I take full responsibility for anything I do, and I don’t recommend others do things my way. But my point is, I think people would follow recommendations more closely if the repercussions were explained
    in more concrete terms. Saying "trust us, we have data" or "trust us, we’re Microsoft" isn’t quite enough for me. :)

  20. DJ says:

    While it would be wonderful to have the nearly unlimited time and resources required to follow all of your recommendations, I know of very, very few IT organizations that do. I would challenge you to visit many of your customers and see where this happens.
    Many of them can’t keep their systems on supported OSes, to say nothing of keeping every system on the latest OS updates, application updates, BIOS/video card/NIC/HBA firmware, etc.

  21. Bill Witten says:

    Ross, we all know history repeats itself. These were the same trends we saw in 1999-2004–different issues but same trends. What has happened is the additional engagement of the PG in driving good designs and adherence to recommendations is now a shadow
    of what it once was. Support costs will rise due to lack of investment in this area until they get high-enough to require a response. Then the PG will involve itself for a few years and then stop again. I think of it as organizational Alzheimer’s. :-)

    I have been helping my current customer interview Exchange candidates and it is very, very reminiscent of 1999. The lack of a "buzz" regarding doing things the right way has a price and the only place that "buzz" can come from is the PG. I really hope you guys
    return to pursuing good planning, design and execution for Exchange deployments before it gets as bad as it did last time.

    If you need any ideas on how to address it, I *may* have a few. :-)

    –billw

Comments are closed.

Skip to main content