623 events - an alternate approach to data gathering


I am ssuming as I write this that everyone has already seen the blogs by Nagesh and Sushil (http://blogs.technet.com/b/exchange/archive/2006/04/19/425722.aspx and http://blogs.technet.com/b/sushil_sharma/archive/2011/05/04/version-store-issues-revisited-again-updates-on-troubleshooting-and-data-gathering-techniques.aspx respectively).

623 events can occur in every version of Exchange.  Essentially the changes that Exchange is planning to make to the database, but hasn't written yet have exceeded the memory space reserved for tracking them.  If the server is simply busy this should not happen.  It typically needs to be 'helped' enter this state by some factor external to the code that is doing the work related to this memory space.

One of the problems with this event is that there are so many possible causes it is impossible for me to give you a handful of concise steps that will usually resolve the issue.  Some of the most common problems that cause the 623 event are:

  • performance problems (especially with the disk(s))
  • Corrupt mail or calendar item (I have seen two cases where this manifested as a user sending an insanely large message in an Org with no message limits)
  • database corruption
  • Third party software interaction

For some of these the resolution is obvious, but the problem is that each of these barely makes it into double digits as a percentage of cases where 623 events are reported.  So what do we do?

The first thing is to know what is happening on your Exchange Server.  Are there any other problems or symptoms leading up to the 623 event?  App, Sys or Cluster logs that have suspicious entries leading up the event?  Is ExBPA clean?  If you gather performance data is there any indication of problems with Disk, memory or processor subsystems?  Does ExMon show any unusual user activity?  Has anything on the server changed recently?

All of the items above have a chance of providing clues as to how to proceed without needing to call support.  For example:  if you see the disk queue is spiking higher than 2 per spindle try relocating some of the disk load to additional disks (preferably on a diffrerent controller or server).  If you see a handful of users gobbling up massive amounts of resources try relocating those users to a different server.  Basically  - look for and follow the symptoms to a possible resolution if your company's internal polical situation and the frequency of the problem allows you to do so.

If you call us I expect my Support Engineers to gather data on all of the above.  However the most efficient way to troubleshoot this from a support perspective is to gather data on the circumstances of the problem while simultaneously setting yourself up to gather dumps the next time the problem occurs.  While we are waiting for the problem to be reproduced, the dumps to be uploaded and their analysis the Support Engineer can look at the other data provided to see if they can make suggestions that might provide relief before the dump analysis is completed.

So this brings us to the heart of what I wanted to write about.  How to collect the dumps.

There are two ways to collect the dumps.  The first is the one Nagesh and Sushil have detailed in their blogs (http://blogs.technet.com/b/exchange/archive/2006/04/19/425722.aspx and http://blogs.technet.com/b/sushil_sharma/archive/2011/05/04/version-store-issues-revisited-again-updates-on-troubleshooting-and-data-gathering-techniques.aspx respectively). The second is to:

  • Follow their steps to set up three hang dumps when version buckets exceed 85% (collect performance data and have procdump trigger at the proper counter threshold as they describe in their blogs)
  • Use weventmon or Task Scheduler to run a bat file that captures a second Hang Dump as soon as the 623 event is logged.  If neither of these is an option you can try setting up procdump to capture a single dump at 100% consumption of verson buckets.

The trick here is that if usage of the Version Store exceeds 85%, but the 623 event does not occur you have to throw out the first hang dumps and set up procdump to collect them again.  A single Hang dump is like a single photo of an action scene from a movie.  The photo may be cool, but there is little that can be discerned about the plot, direction and outcome of the movie from a single frame/photo. The 623 event is simply telling us that usage of the Version Store hit 100%.  Any values less than 100% will be assumed to be normal operation of the Store.exe process.

The method of dump collection previously blogged about by Nagesh and Sushil has an Achilles' Heel in that we often have customers collect the dumps and the performance data, but no actual 623 event occurs.  What happens is that usage of the version store climbs obove the counter threshold, but then goes back down shortly thereafter.  This means that we don't actually have a problem.  Sure usage of the Version Store spiked, but Store.exe recovered and was able to carry on normal operation.  Another problem is that sometimes the counter threshold is set too low.  If that happens we get the 3 dumps, but the transaction that actually pushes the Version Store to 100% isn't in the dumps.  If either of these situations occurs it is likely that your Microsoft Support Engineer will have to tell you the dumps were inconclusive and ask you to set them up again.

Regardless of which method you use to collect the dumps we will probably fail to deliver a useful result that you, as the administrator of the server, can act upon if the counter threshold hang dump(s) is (are) NOT accompanied by a 623 event.  This is simply a consequence of how dumps work.  Each dump is like a single photo of a horse race.  You can't tell from the photo if a horse tripped and spilled all the riders a second after the photo was collected or which horse won the race cleanly.

 


Comments (2)

  1. chrispol says:

    Hi Charlie!

    Anyone who has a copy of WINDBG can analyze a dump.  However the task is almost impossible without access to the Symbols and previous programming experience.  I did a casual search of the Internet and I didn't see the Exchange symbols posted publicly so that would be a significant barrier to tackling this yourself.

    Within Microsoft the support staff who can analyze a dump are just a small subset of the group that stands behind Exchange.  Those staff also have the advantage of source code access.  With this they can locate the problem in WinDBG and then go look at code that was running at the time in an attempt to reconstruct what was happening.  

    If you have the requisite experience you can attempt the analysis, but it would be faster and less painful for Microsoft to do the analysis for you.  If you have concerns about privacy, security and/or clearances let the Support Engineer know when you call in and we can address those concerns on a case by case basis.

    As for additional tips...  The only one that leaps to mind is that recently several of the 623 events we have seen have been related to unusually large, recurring calendar items.  For whatever reason the item needs to be edited again, and when the edit is put through the problem occurs.  If you look at my blog entry called "One cause of RPC Latency" you will see a method for checking item sizes in the calendar.  If you see a large meeting check out its properties.  Recurring meetings that have been around for years and that have been through hundreds (even thousands) of edits can get a little bloated over time.  Normally they cause RPC issues for the client long before they cause Version Store issues, however that isn't always the case or the user doesn't always report the problem.

    The tips topic is a little too broad for this particular discussion forum.  Please give us a call if this does not help or is among the things you have already considered.

    Chris

  2. charlie says:

    Hi Chrispol,

    Very informative article, we have been experiencing 623 event errors for over 3 months now and still haven't been able to find the culprit. Do you have any additional tips on how we can fix this issue? Is there a way for me to analyze the dumps instead of having to upload it to Microsoft?

    Your input is greatly appreciated.

    Thanks,

Skip to main content