I am ssuming as I write this that everyone has already seen the blogs by Nagesh and Sushil (http://blogs.technet.com/b/exchange/archive/2006/04/19/425722.aspx and http://blogs.technet.com/b/sushil_sharma/archive/2011/05/04/version-store-issues-revisited-again-updates-on-troubleshooting-and-data-gathering-techniques.aspx respectively).
623 events can occur in every version of Exchange. Essentially the changes that Exchange is planning to make to the database, but hasn't written yet have exceeded the memory space reserved for tracking them. If the server is simply busy this should not happen. It typically needs to be 'helped' enter this state by some factor external to the code that is doing the work related to this memory space.
One of the problems with this event is that there are so many possible causes it is impossible for me to give you a handful of concise steps that will usually resolve the issue. Some of the most common problems that cause the 623 event are:
- performance problems (especially with the disk(s))
- Corrupt mail or calendar item (I have seen two cases where this manifested as a user sending an insanely large message in an Org with no message limits)
- database corruption
- Third party software interaction
For some of these the resolution is obvious, but the problem is that each of these barely makes it into double digits as a percentage of cases where 623 events are reported. So what do we do?
The first thing is to know what is happening on your Exchange Server. Are there any other problems or symptoms leading up to the 623 event? App, Sys or Cluster logs that have suspicious entries leading up the event? Is ExBPA clean? If you gather performance data is there any indication of problems with Disk, memory or processor subsystems? Does ExMon show any unusual user activity? Has anything on the server changed recently?
All of the items above have a chance of providing clues as to how to proceed without needing to call support. For example: if you see the disk queue is spiking higher than 2 per spindle try relocating some of the disk load to additional disks (preferably on a diffrerent controller or server). If you see a handful of users gobbling up massive amounts of resources try relocating those users to a different server. Basically - look for and follow the symptoms to a possible resolution if your company's internal polical situation and the frequency of the problem allows you to do so.
If you call us I expect my Support Engineers to gather data on all of the above. However the most efficient way to troubleshoot this from a support perspective is to gather data on the circumstances of the problem while simultaneously setting yourself up to gather dumps the next time the problem occurs. While we are waiting for the problem to be reproduced, the dumps to be uploaded and their analysis the Support Engineer can look at the other data provided to see if they can make suggestions that might provide relief before the dump analysis is completed.
So this brings us to the heart of what I wanted to write about. How to collect the dumps.
There are two ways to collect the dumps. The first is the one Nagesh and Sushil have detailed in their blogs (http://blogs.technet.com/b/exchange/archive/2006/04/19/425722.aspx and http://blogs.technet.com/b/sushil_sharma/archive/2011/05/04/version-store-issues-revisited-again-updates-on-troubleshooting-and-data-gathering-techniques.aspx respectively). The second is to:
- Follow their steps to set up three hang dumps when version buckets exceed 85% (collect performance data and have procdump trigger at the proper counter threshold as they describe in their blogs)
- Use weventmon or Task Scheduler to run a bat file that captures a second Hang Dump as soon as the 623 event is logged. If neither of these is an option you can try setting up procdump to capture a single dump at 100% consumption of verson buckets.
The trick here is that if usage of the Version Store exceeds 85%, but the 623 event does not occur you have to throw out the first hang dumps and set up procdump to collect them again. A single Hang dump is like a single photo of an action scene from a movie. The photo may be cool, but there is little that can be discerned about the plot, direction and outcome of the movie from a single frame/photo. The 623 event is simply telling us that usage of the Version Store hit 100%. Any values less than 100% will be assumed to be normal operation of the Store.exe process.
The method of dump collection previously blogged about by Nagesh and Sushil has an Achilles' Heel in that we often have customers collect the dumps and the performance data, but no actual 623 event occurs. What happens is that usage of the version store climbs obove the counter threshold, but then goes back down shortly thereafter. This means that we don't actually have a problem. Sure usage of the Version Store spiked, but Store.exe recovered and was able to carry on normal operation. Another problem is that sometimes the counter threshold is set too low. If that happens we get the 3 dumps, but the transaction that actually pushes the Version Store to 100% isn't in the dumps. If either of these situations occurs it is likely that your Microsoft Support Engineer will have to tell you the dumps were inconclusive and ask you to set them up again.
Regardless of which method you use to collect the dumps we will probably fail to deliver a useful result that you, as the administrator of the server, can act upon if the counter threshold hang dump(s) is (are) NOT accompanied by a 623 event. This is simply a consequence of how dumps work. Each dump is like a single photo of a horse race. You can't tell from the photo if a horse tripped and spilled all the riders a second after the photo was collected or which horse won the race cleanly.