The Compound Case of the Outlook Hangs


This case was shared with me by a friend of mine, Andrew Richards, a Microsoft Exchange Server Escalation Engineer. It’s a really interesting case because it highlights the use of a Sysinternals tool I specifically wrote for use by Microsoft support services and it’s actually two cases in one.

The case unfolds with a systems administrator at a corporation contacting Microsoft support to report that users across their network were complaining of Outlook hangs lasting up to 15-minutes. The fact that multiple users were experiencing the problem pointed at an Exchange issue, so the call was routed to Exchange Server support services.

The Exchange team has developed a Performance Monitor data collector set that includes several hundred counters that have proven useful for troubleshooting Exchange issues, including LDAP, RPC and SMTP message activity, Exchange connection counts, memory usage and processor usage. Exchange support had the administrator collect a log of the server’s activity with 12 hour log cycles, the first from 9pm until 9am the next morning. When Exchange support engineers viewed the log, two patterns were clear despite the heavy density of the plots: first and as expected, the Exchange server’s load increased during the morning when users came into work and started using Outlook; and second, the counter graphs showed a difference in behavior between about 8:05 and 8:20am, a duration that corresponded exactly to the long delays users were reporting:

image

The support engineers zoomed in and puzzled over the counters in the timeframe and could see Exchange’s CPU usage drop, the active connection count go down, and outbound response latency drastically increase, but they were unable to identify a cause:

image

They escalated the case to the next level of support and it was assigned to Andrew. Andrew studied the logs and concluded that he needed additional information about what Exchange was doing during an outage. Specifically, he wanted a process memory dump of Exchange when it was in the unresponsive state. This would contain the contents of the process address space, including its data and code, as well as the register state of the process’s threads. Dump files of the Exchange process would allow Andrew to look at Exchange’s threads to see what was causing them to stall.

One way to obtain a dump is to “attach” to the process with a debugger like Windbg from the Debugging Tools for Windows package (included with the Windows Software Development Kit) and execute the .dump command, but downloading and installing the tools, launching the debugger, attaching to the right process, and saving dumps is an involved procedure. Instead, Andrew directed the administrator to download the Sysinternals Procdump utility (a single utility that you can run without installing any software on the server). Procdump makes it easy to obtain dumps of a process and includes options that create multiple dumps at a specified interval. Andrew asked the administrator to run Procdump the next time the server’s CPU usage dropped so that it would generate five dumps of the Exchange Server engine process, Store.exe, spaced three seconds apart:

procdump –n 5 –s 3 store.exe c:\dumps\store_mini.dmp

The next day the problem reproduced and the administrator sent Andrew the dump files Procdump had generated. When a process temporarily hangs it’s often because one thread in the process acquires a lock protecting data that other threads need to access, and holds the lock while performing some long-running operation. Andrew’s first step was therefore to check for held locks. The most commonly used intra-process synchronization lock is a critical section and the !locks debugger command lists the critical sections in a dump that are locked, the thread ID of the thread owning the lock, and the number of threads waiting to acquire it. Andrew used a similar command, !critlist from the Sieext.dll debugger extension (the public version of which, Sieextpub.dll, is downloadable from here). The output showed that multiple threads were piled up waiting for thread 223 to release a critical section:

image

His next step was to see what the owning thread was doing, which might point at the code responsible for the long delays. He switched to the owning thread’s register context using the ~ command and then dumped the thread’s stack with the k command:

image

As we’ve seen in previous Case of the Unexplained cases, the debugger was unsure how to interpret the stack when it came across a stack frame pointing into Savfmsevsapi, an image for which it couldn’t obtain symbols. Most Windows images have their symbols posted on the Microsoft symbol server so this was likely a third-party DLL loaded into the Store.exe process and was therefore a suspect in the hangs. The list modules (“lm”) command dumps version information for loaded images and the path of the image made it obvious that Savfmsevsapi was part of Symantec’s mail security product:

image

Andrew checked the other dumps and they all had similar stack traces. With the anecdotal evidence seeming to point at a Symantec issue, Andrew forwarded the dumps and his analysis, with the administrator’s permission, to Symantec technical support. Several hours later they reported that the dumps indeed revealed a problem with the mail application’s latest antivirus signature distribution and forwarded a patch to the administrator that would fix the bug. He applied it and continued to monitor the server to verify the fix. Sure enough, the server’s performance established fairly regular activity levels and the long delays disappeared.

However, over the subsequent days the administrator started to receive, albeit at a lower rate, complaints from several users that Outlook was sporadically hanging for up to a minute. Andrew asked the administrator to send a correlating 12-hour Performance Monitor capture with the Exchange data collection set, but this time there was no obvious anomaly:

image

Wondering if the hangs would be visible in the CPU usage history of Store.exe, he removed all the counters except for Store’s processor usage counter. When he zoomed in on the morning hours when users began to login and load on the server increased, he noticed three spikes around 8:30am:

image

Because the server has eight cores, the processor usage counter for an individual process has a possible range between 0 and 800, so the spikes were far from taxing the system, but definitely higher than Exchange’s typical range on that system. Zooming in further and setting the graph’s vertical scale to make the spikes more distinct, he observed that average CPU usage was always below about 75% of a single core and the spikes were 15-30 seconds long:

image

What was Exchange doing during the spikes? They were too short-lived and random for the administrator to run Procdump like he had before and reliably capture dumps when they occurred. Fortunately, I designed Procdump with this precise scenario in mind. It supports several trigger conditions that when met, cause it to generate a dump. For example, you can configure Procdump to generate a dump of a process when the process terminates, when its private memory usage exceeds a certain value, or even based on the value of a performance counter you specify. Its most basic trigger, though, is the CPU usage of the process exceeding a specified threshold for a specified length of time.

The Performance Monitor log gave Andrew the information he needed to craft a Procdump command line that would capture dumps for future CPU spikes:

procdump.exe -n 20 -s 10 -c 75 -u store.exe c:\dumps\store_75pc_10sec.dmp

The arguments configure Procdump to generate a dump of the Store.exe process when Store’s CPU usage exceeds 75% (-c 75) relative to a single core (-u) for 10 seconds (-s 10), to generate up to 20 dumps (-n 20) and then exit, and to save the dumps in the C:\Dumps directory with names that begin with “store_75pc_10sec”. The administrator executed the command before leaving work and when he checked on its progress the next morning it had finished creating 20 dump files. He emailed them to Andrew, who proceeded to study them in the Windbg debugger one by one.

When Procdump generates a dump because the CPU usage trigger is met, it sets the thread context in the dump file to the thread that was consuming the most CPU at the time of the dump. Since the debugger’s stack-dumping commands are relative to the current thread context, simply entering the stack dumping command shows the stack of the thread most likely to have caused a CPU spike. Over half the dumps were inconclusive, apparently captured after the spike that triggered the dump had already ended, or with threads that were executing code that obviously wasn’t directly related to a spike. However, several of the dumps had stack traces similar to this one:

image

The stack frame that stuck out listed Store’s EcFindRow function, which implied that the spikes were caused by lengthy database queries, the kind that execute when Outlook accesses a mailbox folder with thousands of entries. With this clue in hand, Andrew suggested the administrator create an inventory of large mailboxes and pointed him at an article the Exchange support team had written that describes how to do this for each version of Exchange:

image

Sure enough, the script identified several users with folders containing tens of thousands of items. The administrator asked the users to reduce their item count to well below 5000 (the Exchange 2003 recommendation – this has been increased in each version with a recommendation of 100,000 in Exchange 2010) by archiving the items, deleting them, or organizing them into subfolders. Within a couple of days they had reorganized the problematic folders and user complaints ceased entirely. Ongoing monitoring of the Exchange server over the following week confirmed that the problem was gone.

With the help of Procdump, the compound case of the Outlook hangs was successfully closed.


Comments (39)

  1. Anonymous says:

    Ray: You need to use the x86 version in the x86 debugger, and the x64 version in the x64 debugger.

  2. Anonymous says:

    – ac –

    Use Performance Tools for Windows (also in the Windows SDK) – its like perfmon, but gathers much for granular data. Run it in ring mode with a 1Gb backing file, when you get the glitch, stop the capture.

      xperf -on Base+Diag+Latency+FileIO+DRIVERS+DPC -ring -maxFile 1024

      echo Press a key when you want to stop…

      pause

      xperf -stop -d result.etl

    Then use xPerfView to zoom into the msecs of interest to find the driver/etc. that is causing the delay.

  3. Anonymous says:

    – SysAdmin-E —

    Procdump is now the preferred dump tool used by Exchange Support because it :-

    1) has better triggers,

    2) is a tiny download & is xcopy deployable – the Debugging Tools for Windows are also xcopy'ble, but require an installation on another client prio (via the Windows SDK).

    In this case, mini-dumps were taken (no -ma switch on procdump), so the hang (pause) lasted for <5sec, (i.e. no outage) – this is why we could take so many (-n 20)…

    In the end, ADPlus.vbs and Procdump call the same APIs to do the dumping; Procdump is just easier to install, configure, run, remove, and debug (because of the added comment and default thread context).

    – ExpectedJoke –

    This was Exchange 2003, hence the low (<5,000) folder item limit mentioned.  Exchange 2007 and Exchange 2010 handle these expensive searches much better/faster.  

    PS: You can tell this is Exchange 2003 because it uses x86 stacks; all other versions of Exchange are x64. :-)

  4. Anonymous says:

    Debugging can be tackled two ways: use experience or use some generic techniques. As an Exchange Escalation Engineer I mainly use experience. I look at lots of user dumps (of Exchange) and kernel dumps (of Windows) every day, and just know what is or isn't a concern (it took less than 30min to review these 20 dumps).  This of course isn't possible for the vast majority of people. As such, if you are not familiar with an application's thread stacks, you can apply some generic debugging techniques, and in most cases, review a dump with the public (or no) symbols, and only use a minidump (no -ma switch). Read on…

    As Mark writes, the first step is to get a good set of dumps (at least a few hits out of the 20). The spike threshold used was very close to the operational (maximum) norm – close enough that a pegged single thread (+100%) would be easily picked up in the midst of 100-200 threads doing a few msec's of work each (per second) in a low usage period of the day.  In this case, the collective total of all these relatively dormant threads is equivalent to a single thread using 20-75% of a single CPU.  As mentioned, the first step is to gather performance counters so you can account for the background hum of your process; doing this correctly yields lots of positive dumps, and only a few false-positives.  In this case, the biggest spike generated a maximum CPU usage of 147%, which means that the background hum (the other 200+ threads) only used 47% (147-100=47) at this time.

    Usually, procdump scales CPU from 0..100 (no -u switch), so the 75% threshold has to be divided by 8 (the number of cores) to get the correct value for the switch. By using the -u switch, the -c value is much easier to determine when you have performance counters as the "% Processor" counter and "-u -c NN" value are scaled the same: 0..100% x Cores. Plus, using -u has the added benefit that you don't have to find out how many cores the server has so as to do the division!

    "1. How did you deduce that the 'inconclusive dumps' were not related to the spikes?"

    "2. How did the EcFindRow frame 'stick out' from the other frames?"

    The Answer to #1 is in the 4 of 20 section.

    The Answer to #2 is in the 6 of 20 section.

    10 of 20: Because the dumps were taken over night, backup and online maintenane ran on the server… These operations involve only a single thread and generate a significant amount of CPU usage for 45-60sec.  As such, they generated 10 of 20 dumps around 11pm and 2pm… The background usage of the Outlook clients was still there at around 20% during this period. In other words, these disk bound threads only needed to use 55% (75-20=55) at this time to trigger the dumps. These were positive dumps as the dump met the criteria, they just were not what we were looking for/concerned about.

    4 of 20: When Procdump makes a CPU triggered dump, the default thread context is set to the thread that used the most amount of CPU in the monitoring period.  When you open the dump in WinDBG, you'll get a cursor homed to this thread (e.g. 0:132>) – Note, this can sometimes be thread #0 (0:000>). In these 4 cases, the busy thread was actually doing nothing – how can this be?  Well, in these cases, the worker thread WAS busy, but finished it's 10 seconds of usage JUST before the dump was taken. For example, it could have used 60% of this last second, and the other 200+ threads contributed 55% to the CPU usage – thus the total for the second was 115%, and the dump triggered – albeit too late for the culprit thread to be seen. Since, Exchange has worker threads, and the CPU spike thread was dormant. I classified these as false-positive dumps.

    6 of 20: Six of the dumps had the EcFindRow operation. EcFindRow is the folder search operation of Exchange and happens every time you open a folder for viewing or searching. The bigger the folder, the longer it takes to run. Of interest, twice the triggerings were adjacent. That is, the spike lasted over 20sec (2x10sec) and two dumps were taken for the same EcFindRow operation. You can correlate two mini dumps by using the stacks. If you find a matching thread ID ("~.") and matching stack data ("kb"), you can assume that the operation is continuing in the second dump.

    - Andrew Richards (MSFT)

  5. Anonymous says:

    Hey Mark and Andrew you have always been an inspiration Thanks for the Posts , I have a Small query though can Xperf be used to take care of issue , can we analyze the ETL files without Internals Symbols?? If yes how can you shed some Light on it ??

    Regards

    Puneet

    puneet.banga@hotmail.com

  6. Anonymous says:

    Great Post .. Good work Andrew ! Thanks for sharing Mark !

  7. Anonymous says:

    Puneet Banga – As with any use of dbghelp.dll (e.g. in xperf, in procexp, in procmon, in WinDbg, etc.), the extent to which you can see a stack is dependant on the symbols available. Most of Windows symbols are available publically, but only a limited amount is published for other Microsoft applications.

    To use the public web symbols with Xperf, use this cmd file:-

    md C:Symbols_Public

    md C:SymCache_Public

    set _NT_SYMBOL_PATH=SRV*C:Symbols_Public*msdl.microsoft.com/…/symbols

    set _NT_SYMCACHE_PATH=C:SymCache_Public

    xperf.exe %1

  8. Luciano Evaristo Guerche (Gorše) says:

    I must completely agree: SysInternals rocks!

  9. Mike DiBella says:

    This is a fasinating discourse into the complexity of diagnosing complex client/server performance issues.  Many thanks for sharing.

  10. James Bray says:

    Superbly fascinating as always.  Looking forward to the next one…

  11. wr says:

    Half-reading this article i was suspecting for Symantec to be involved. Bingo!:) I had myself experienced server freezes because of Application Control driver of Symantec Endpoint Protection. Though i didn't use Procdump, just examined dumps with WinDbg. And we also do not install Network Threat Protection module on servers as it's likely to cause problems, especially on Hyper-V enabled servers. Everything from Symantec (server-side and client-side) seems to be evil.

    SysInternals tools indeed rocks. I have inspected and solved lots of problems with them, including finding malware with Autoruns, examining hangs with Process Monitor.

  12. SysAdmin-E says:

    Great article. I have a few questions: 1) What's the advantage of using procdump over adplus.vbs, as in adplus.vbs -hang -pn store.exe? I've had MS support advise me to run that command before. I know that it pauses the store.exe process while it's taking the dump, which means Exchange isn't available during that time. 2.) Does procdump also pause the store.exe process when used as shown in this article?

  13. Nick says:

    Fantastic as usual, Mark.  Thanks for another great post.

  14. Bård says:

    Fantastic! Thanks for sharing this. I just love the science of debugging. And SysInternals does indeed rock :-)

  15. ExpectedJoke says:

    Is this Exchange 2010? Too bad that the symbols are not available publicly for 2007 or 2010

  16. Isaac Gardner says:

    Thanks Mark, once again, great post. I have a couple of questions.

    You said:

    "Over half the dumps were inconclusive, apparently captured after the spike that triggered the dump had already ended, or with threads that were executing code that obviously wasn’t directly related to a spike."

    1. How did you deduce that the 'inconclusive dumps' were not related to the spikes?

    2. How did the EcFindRow frame 'stick out' from the other frames?

    Cheers

    Isaac

  17. Isaac Gardner says:

    Thanks Mark, once again, great post. I have a couple of questions.

    You said:

    "Over half the dumps were inconclusive, apparently captured after the spike that triggered the dump had already ended, or with threads that were executing code that obviously wasn’t directly related to a spike."

    1. How did you deduce that the 'inconclusive dumps' were not related to the spikes?

    2. How did the EcFindRow frame 'stick out' from the other frames?

    Cheers

    Isaac

  18. Johnny Sutherland says:

    A fantastic post – as usual. I always enjoy reading your posts. Thank you for taking the time to write it.

  19. ac says:

    It's nice when you can get some graphs or data to look at but the issue I've been having is one that doesn't register in any sysinternal tool I've tried and it occurs at most daily taking few to dozen minutes before going away.

    Thinking it now I do get some ideas. Since it rears its head as dropped frames in UI rendering, one should be able to create app that notices this drop in frame rate but since the rendering glitches last just milliseconds during which it's possible that something has the exclusive use of the system, it may be really problematic try to do a dump during the glitch if you can't even get a dump code running during one. Then again I don't recall the glitches affecting audio so maybe the glitches only affect some threads.

  20. Brice says:

    Hello,

    Wow that was an excellent entry. It's actually inspiring for us who works on performance problems. I'm working on Java applications, but the process is the same. We don't have the procdump facility to monitor performance counter spikes. But it is still possible to script them, either on Unix or Windows.

    Cheers

    Brice

  21. ExpectedJoke says:

    I remember using pfdavadmin for high item count in exchange 2003, but the link for finding items refers to a powershell script (2007)!  

    need more posts Andrew! cheers

  22. Dustin says:

    Awesome Andrew and Mark!  I, like one of the other posters, had a hunch that Symantec was being used.  I literally laughed out loud when I saw the SAVFMSE,,,,  entry!

    Love to read the thought process and methods!

    Thanks guys!

  23. Andrew says:

    Great!

    Can you say anything when to expect Windows Internals 6th edition?

    Thanks!

  24. James Sutherland says:

    I had two support incidents in the last month or two with some similarities, and some frustrating differences as well:

    One involved a laptop was locking up hard for a short time (30 seconds) of apparent frantic disk activity. Since the machine was unresponsive, anything I attempted while the problem was visible took effect too late, so my next approach was to try reproducing the problem with Process Monitor running. These logs seemed to point to a disk-intensive process – but not a specific one. Eventually, the issue turned out to be a driver problem: the SATA chipset driver was locking up under heavy load until some timeout reset the SATA subsystem and operations resumed. Somewhat ironically, reverting to the generic SATA driver resolved the issue.

    The other was a co-worker's personal system with network communication problems. IE could surf fine, Windows Update would apply updates – but anti-malware tools, iTunes, Safari etc all failed. The culprit, of course, much like this tale, was some Symantec remnant interfering with network traffic despite having been "uninstalled". As soon as I scraped it out manually and rebooted, everything was fine!

  25. Wayne Robinson says:

    Amazing stuff, just amazing :)

    Thanks for sharing your experiences!

  26. Patrick Hoban says:

    Another excellent article. I'm impressed as usual.

  27. brian says:

    yes, all fine and good. but you failed to mention the solution for most difficult part of the whole scenario – "The administrator asked the users to reduce their item count …" You want me to do what? I don't have time to do that. I'm too busy to do that I'm too important to spend my time…

  28. Fm says:

    Just out of curiosity, how much an Exchange Escalation Engineer earns (ballpark) ?

    This was an amazing detective work that, one would hope, should be well remunerated.

  29. ASB says:

    Very well done, Mark.  And excellent follow-up by Andrew, too.

    Much appreciated for both the tools and the methodology to get these issues resolved.

    http://xeesm.com/AndrewBaker

  30. Ray says:

    Great article.  I wanted to walk through some of it but unfortunately the SIEExtPub.dll that the link downloads doesn't load up in WinDbg.  I get this:

    The call to LoadLibrary(SIEExtPub) failed, Win32 error 0n193

       "%1 is not a valid Win32 application."

    I am a WinDbg neophyte so it is probable that I'm doing something wrong…

  31. Hmmm says:

    You said this was a "really interesting case"…

  32. Mike Foley says:

    Great article.. It reminded me of many late nights chasing locks on OpenVMS. We had a live tool that would allow us to get right into the guts of a hung system and poke around, looking for resource wait conditions. Once you found that, it was a simple (relatively) matter of tracking back to find the process that was stuck and dump its memory to see what was broken.

  33. Parkarab says:

    Very funny that I just read this after resolving an issue with several Outlook users here in my company. Again, Symantec corporate client was the culprit.

    Seems users were upgraded to Symantec endpoint protection. Unfortunately, the Outlook add-in from corporate client v 10.1 was not removed. It kept causing unusual freezing of Outlook.

  34. WinSE-4VR says:

    We've had this before…

    Back in 1999, on the old WinSE NT4SP6 team, we had this exact same thing happen but on Outlook. A buggy version of the standard AV package would occasionally exec-lock the mailbox pst, and Outlook would think it was corrupted and prompt the user with an ambiguous message to 'fix' it. The exec lock would prevent a fix, so the pst would be flattened to 32k or whatever the default brand-new empty size was, with the usual catastrophic results. The AV patch had been pushed to the 2 main distro servers 6 months prior, but not to the team's semi-private local-server-of-convenience…

    Much lolz to see this again 11 years later and the scourge of 3rd party exec locks persists…

  35. Acemutha says:

    It's always very nice to read your articles so much to learn from

  36. jader3rd says:

    While I do like the debugging that went into finding out why the Exchange server wasn't being as responsive as it should have been, I don't like how the issue is concidered solved. The real issue is "Why does Outlook hang when the server is being unresponsive?" Outlook is supposed to be a smart/fat client. Yet, the Outlook team has passed along most (if not all) of their performance issues onto the server. Having the UI on the client be dependent on the server is ridiculous. The Exchange team writes a more performant server and the Outlook team takes that as a free pass on not creating a more performant design of their product.

    I expect the client to be perfectly responsive regardless of the performance of my connection to the server.

  37. BillP says:

    I found the trouble-shooting discussion useful, but the end result even more interesting. Reducing the "item count" in some user's folders fully solved the issue. Outlook Express, still used by untold millions, has similar odd issues when the number of items in the various folders (INBOX, SENT, DRAFT, TRASH, etc.) exceeds "a limit". In every case I've looked at, when any given folder exceeded 1 Gb. in size, Outlook Express had a variety of issues – not being able to compact folders being the most usual. In every case, deleting messages or moving them to newer folders and reducing the file size  (say below 500 Mb.) solved the problem(s).

  38. XMLFREAK says:

    Again Mark, you show the way to go with sysinternal tools. Thanks for making such great debugging en troubleshooting tools. On top of that thank you very much for explaining in detail how to use all these tools together. Greetings from Holland

  39. Oscar says:

    Better than a Dan Brown's Best Seller: Sysinternals and Exchange ROCKS!!!