Reasons why we ask for removal of third party software when troubleshooting Exchange

Over the last few months, we have seen an increase in the number of critical situations (referred to as CritSits) where third party software is causing issues within Exchange 2013 and 2016.  As an Exchange Engineer, we have seen several trends with many of these cases, and this blog post is to help ease the fear of our customers when faced with a decision to disable or remove their third party software.

When troubleshooting something as complex as Exchange, we often times ask the following questions:

  • What version of Exchange are you on?
  • What is your server/DAG configuration?
  • Has there been any recent change in the environment?
  • Can we remove any third party add ins?

Especially when we start troubleshooting client connectivity issues, otherwise known as XCSI.

As Exchange engineers, we understand that removing third party add ins can be difficult to almost impossible for some organizations.  Please understand we do not go and ask for the removal of this software lightly.   Many of the support engineers you talk to, especially once you get to the tier 3 level, used to work with Exchange prior to joining Microsoft.  We know it's not easy, but it's a valid and important part of the troubleshooting process.

Some of the push back we get often times involves these questions:

  • Can't you just troubleshoot around it?
  • It's never been an issue in the past!
  • Our internal processes will not allow for removal of the product.

And while we understand the difficulty of our request, please understand that for many advanced forms of troubleshooting, removal of the product is going to be required.

For example, in the last few months, a number of customers experienced the following:

  •  A third party product was directly interfacing with store.exe, inhibiting the Exchange information store from functioning correctly. This was causing random failovers, as store would find itself crashing and then we would initiate a failover.
  • A third party product created an additional transport agent, which was not able to handle the amount of incoming mail for a customer's environment, causing significant Edge Transport delays.
  • A third party product was installed to help monitor the network traffic on the Exchange servers.  However, this product was not capable of processing the amount of traffic Exchange handles, which resulted in high CPU and insufficient system resources to monitor the amount of incoming Active Sync and Outlook requests.
  • A third party product was installed, however it was not accepting the anti virus exclusions and instead was scanning the customer's databases, resulting in database lock and corruption.
  • An instance of where a third party product was interfering with HTTPS traffic, causing significant handle and memory usage, resulting in an Exchange outage, however the customer's internal security team would not allow removal of the product, resulting in a prolonged outage for the customer.

In many of these cases, the customer was very adamant that we could not uninstall or even disable their third party products, resulting in some very long critical situations for these customers and loss of revenue.  Our goal as engineers is to work with you and your organization to get email resources restored as quickly as we can.

In some cases, we can do troubleshooting around the third party products, such as when a third party transport agent is installed.  We can take a pipeline trace, as outlined at https://technet.microsoft.com/en-us/library/bb125018(v=exchg.150).aspx.  Then, we can see which transport agent is misbehaving.  However, in many cases, we also end up having to disable the transport agent, especially if we see something like this:

Time:     1/01/2017 1:01:01 AM
ID:       1050
Level:    Warning
Source: MSExchange Extensibility
Machine:  mail.contoso.com
Message:  The execution time of agent 'Third Party Agent' exceeded 90000 milliseconds while handling event 'OnRoutedMessage' for message with InternetMessageId: 'XXXXXXXXXX'. This is an unusual amount of time for an agent to process a single event. However, Transport will continue processing this message.

But, it is not always possible for us to do that in more complex environments.

A question we often get is, can we just disable the software?  In many cases, this is not possible due to File System Filter Drivers.  These drivers, as defined at https://msdn.microsoft.com/en-us/windows/hardware/drivers/ifs/what-is-a-file-system-filter-driver- states:

Afile system filter driveris an optional driver that adds value to or modifies the behavior of a file system. A file system filter driver is a kernel-mode component that runs as part of the Windows executive.

A file system filter driver can filter I/O operations for one or more file systems or file system volumes. Depending on the nature of the driver,filtercan meanlog,observe,modify, or evenprevent. Typical applications for file system filter drivers include antivirus utilities, encryption programs, and hierarchical storage management systems.

Because of the way these file system filter drivers interject themselves into the system, in order to truly disable them, we need to uninstall the product.  Again, this is not the fault of the software, but simply the way these drivers are designed to work.  To borrow a line from the movie "Star Wars: Episode III - Revenge of the Sith" Obi-Wan Kenobi states "They are doing their job so we can do ours."  However, we have seen where an older, out of date version, or a misconfiguration, such as no antivirus exceptions, or incorrect antivirus exceptions,  are in place, it causes Exchange performance to suffer or fail. 

When misconfigured, we can see where third party software can cause files to lock, in a similar fashion to below:

Time:     1/1/2017 1:01:10 AM
ID:       489
Level:    Error
Source: ESE
Machine:  mail.contoso.com
Message:  msexchangerepl (4456) An attempt to open the file "D:\Databases\DB01.EDB" for read only access failed with system error 32 (0x00000020): "The process cannot access the file because it is being used by another process. ".  The open file operation will fail with error -1032 (0xfffffbf8).

If these files are locked during a crucial process, this can cause Exchange to fail over, information store to crash, or any number of other issues.  There's several ways we can attempt to track this, including using process explorer.  This is not always applicable in every case however, and in many cases, we need to exclude the products which we know can cause this issue.

We also have instances where we need to take a process dump (many times called a ProcDump) or Time Travel Tracing (often referred to as an iDNA).  With iDNA in particular, unless we have antivirus removed, the iDNA will in almost every occasion not complete successfully.  In those instances, we will have to remove the third party software in order for us to continue to troubleshoot.

What we have discovered in working with many third party vendors is we will find customers have either ignored the advice of their third party vendors or Microsoft.  As Ross Smith IV once put it so eloquently in his blog titled "Concerning Trends Discovered During Several Critical Escalations":

"Another concerning trend I witnessed is that customers repeatedly ignored recommendations from their product vendors. There are many reasons I’ve heard to explain away why a vendor’s advice about configuring or managing their own product was ignored, but it’s rare to see a case where a customer honestly knows more about how a vendor’s product works than does the vendor. If the vendor tells you to configure X or update to version Y, chances are they are telling you for a reason, and you would be wise to follow that advice and not ignore it.

Microsoft’s recommendations are grounded upon data– the data we collect during a support call, the data we collect during a Risk Assessment, and the data we get from you. All of this data is analyzed before recommendations are made. And because we have a lot of customers, the collective learnings we get from you plays a big part."

In one instance, a third party vendor told a customer to not install their product on Exchange, as it has difficulties handling the bursty nature of Exchange under high traffic loads, and recommended a different version which was in line with being supported on the customer's version of Exchange.  However, this customer then turned to Microsoft and requested we assist them in configuring the original third party software, over the objections of their vendor.  In cases like this, because Microsoft will not have any knowledge of the inner workings of the vendor's software, we were unable to help with this implementation.

In almost every instance, we find the third party software doing exactly what it is supposed to be doing - have hooks into the operating system or processes to perform certain tasks, or prevent certain tasks from taking place.  However, in extremely busy pieces of software, such as Exchange, we hit a code optimization path point, where the software can only handle so many instructions in a period of time.  This could be due to a limitation within how the code is executed or any number of reasons.  Our job is not to point blame, but to get your environment back up and operating as quickly as possible.

Help Us Help You

The biggest ways we can help you is to perform the actions we have.  Help us to eliminate any road blocks which might be in the way to enacting the action plans.  Even if removing the third party software doesn't eliminate the issue, it allows for us to perform additional troubleshooting steps which we may need to do.  Remember, taking some more advanced troubleshooting such as procdumps and iDNA's are difficult, to if not impossible for us to take if we do not remove the hooks these third party pieces of software put into our products.

We know uninstalling third party software is not what any of us want to do - from you, down to your account manager, down to the engineer you are working with.  We understand there are risks associated with uninstalling AV and other products.  We've been there - in your shoes, most likely trying to figure out the same question - 'How do I do what my vendor is saying?".  However, in certain cases, it is unavoidable.

In every scenario outlined above, the resolution and restoration of service step was the removal of the third party product.  It wasn't necessarily due to the third party product doing anything wrong, however it was either misconfigured or simply could not handle the resource intensive requests which happens within Exchange environments.

Troubleshooting is like taking a very large onion, peeling back the layers, and finding out what is going on.  In large and complex environments, such as Exchange, we will find many more layers of onions or unnecessary complexity.  As engineers, you may have been working with your environment for over ten years, however on the phone, we have to get up to speed in an hour or less.  Help us to reduce the complexity, and help us to help you get your environments back up as quickly as we can.  We understand talking to an Exchange Engineer was most likely not on your plan that day, but we are here to help you as much as we can.  I hope this article helps to explain some of our steps and items we look for when troubleshooting large and complex Exchange environments.