EDIT: 5/6/2010: In order to see some additional ways to collect information related to those problems, please see this blog post.
Recently we have been seeing several issues in Support Services where Event ID 623 is being logged on the server. The symptoms include "Users unable to access their inbox", "Email backing up in the local delivery queue" and "Email awaiting directory lookup". Clients may also see emails stuck in the Outbox. Rebooting the server usually fixes the issue. In this blog, I have tried to explain what "Version Store" is and how to troubleshoot this issue.
A transaction is a series of operations that are treated as atomic (indivisible). So the operations in a transaction are either all done and permanently saved, or none of them are done. As an example we can consider the operations involved when we do something very simple like move a message from the inbox to the deleted items folder. The Version Store gives ESE the ability to track and manage current transactions; this allows ESE to implement Isolated and Consistent transactions.
What is Version Store?
The Version Store keeps an in-memory list of modifications made to the database. This list has several uses:
If a transaction needs to rollback it looks in the Version Store to get the list of operations it performed. By performing the inverse of all the operations the transaction can be rolled-back
If two different sessions try to modify the same record the Version Store will notice and reject the second modification
When a session begins a transaction it always sees the same view of the database, even if other sessions modify the records it is looking at. When a session reads a record the Version Store is consulted to determine what version of the record the session should see
A complicated optimization that lets us log less data than "other" database engines.
In simple terms, the Version Store is where transactions are held in memory until they can be written to disk. If something is preventing us from completing transaction or writing to disk we will consume this cache and the store will stop responding to request until there is room in the cache again.
Event 623 is the result, typically, of a long running transaction. The result of this long running transaction is to exhaust resources in the version store. As a result, the Version Store no longer reaps deleted records causing unneeded data, which is marked as deleted, to accumulate in the database. The accumulation of unneeded data can exacerbate performance problems which can lead to event id 623. No more transaction can continue until this is clear. This error is NOT the result of the system running out of memory. If we failed to allocate more memory and NT refuses to give it to us we will fail with a different error. Increasing the RAM in the server will NOT fix this problem. Also, there's no relationship between write-back caching and the Version Store. Enabling or disabling write-back caching has no effect on the Version Store whatsoever.
The events 623, 1022, 1097 are a result of Version Store entries not getting cleaned up, either not at all or not in a timely enough fashion. This can happen for one of two reasons.
First, in order to properly reconcile write-conflicts and properly support repeatable reads, a given entry in the Version Store cannot be cleaned up until it is older than the oldest active transaction. This means that if the store opened a transaction and kept it open indefinitely (perhaps for some long-running operation such as grovelling the entire database for whatever unknown reason) or otherwise orphaned the transaction, Version Store cleanup would be precluded due to this active transaction and it would therefore continue to build in size until you reached the maximum.
The second reason Version Store entries might not be getting cleaned up is that version cleanup simply can't keep up with the load on the machine. The cleanup of Version Store entries is performed by an asynchronous background thread. As transactions commit or roll back, this asynchronous background thread will clean up the Version Store entries that are older than the oldest of the remaining active transactions. However, if there is so much write activity going on that it outpaces this cleanup thread, well, then you're going to get into a state where the Version Store continues to build up in size, but the version cleanup thread can't keep up with cleaning out old entries, so eventually you'll reach the maximum Version Store size. Of the two reasons why the Version Store can't be cleaned up, it's almost always caused by a long-transaction as opposed to the Version Store cleanup thread not being able to keep up with the load on the machine.
The size of the Version Store is determined by the msExchESEParamMaxVerPages on the "Storage Group" object in Active Directory. This setting is in units of 16K. If the attribute is not set, the default is 9280 in decimal. The attribute can be increased to 10280 and then 11280 to avoid this situation. This will result in less memory available for other function in the store and could result in other performance issues.
What happens when the Version Store reaches its maximum?
Well, for one thing, you'll see the 623 event indicating that the maximum Version Store size has been reached. Write operations to the database will start failing with -1069 (JET_errVersionStoreOutOfMemory) because there's no more Version Store space in which to record the operation. Additionally, you will start seeing the 602 events when the Version Store size starts approaching the maximum (you should actually see the 602 events even before the 623 event). What this event means is that any expensive cleanup operations will be skipped in an effort to get the asynchronous background Version Store cleanup thread to run as fast as possible. Every 100 operations that are skipped will cause us to generate the 602 event. If you're seeing tons of the 602 events right after a 623 event, then what is likely happening is that hitting the -1069 (version-store-out-of-memory) conditions has caused the long-running transaction to rollback. Since this transaction is no longer present to hold back version cleanup, version cleanup now proceeds to try to clean all the entries in the Version Store, which is now huge. Because it's so large, we will skip all expensive cleanup operations until the size is reduced. Because we skip so many operations, this leads you to see the spam of 602 events.
As for why the 602 event is generated even though online defrag has been scheduled for later on, that's because Jet has no knowledge of online defrag scheduling. The scheduling is done by the Store.
The key question to answer is whether 623 is the result of regular 602. Have you tried widening the on-line defragmentation window as the event indicates? Are the server's affected very heavily loaded, relative to their hardware when compared to servers that are unaffected? It is possible that regular 602 messages indicating that on-line compaction is not running to completion can results in indexes that are so inefficient, that transactions take longer and longer. The result would be the 623 seen here which makes the problem of inefficient queries even worse. Lastly, in such cases, off-line defragmentation will dramatically improve the database efficiency such that event 623 will not occur for some time, but then later occur after the database has become relatively defragmented because on-line defragmentation has not been able to complete. Try to take steps to ensure that on-line defragmentation is completing.
Increasing the Version Store size may not necessarily be the correct solution here. It's probably better to try to debug the problem and figure out why the Version Store got so big in the first place. The 623 event reports the Jet session holding the oldest transaction (and the thread it was on) at the time the version-store-out-of-memory condition was hit. Ideally, you'd want to set a breakpoint at the place in Jet that returns JET_errVersionStoreOutOfMemory and then use the information in the 623 event to identify the thread holding the long-running transaction and then debug what the Store is doing in that transaction that's taking so long.
Troubleshooting this event
There are lots of different reasons that can cause event id 623; database corruption, huge messages, at times third party software integration, to name a few.
I would start by doing the following:
1. Use Microsoft Operations Manager to monitor for 623 events.
2. Rule out Antivirus by possibly applying the registry modifications/tunes noted in the EXBPA report for the Antivirus. Use the Microsoft Exchange Server Best Practices Analyzer Tool to resolve any outstanding critical issues.
3. Confirm Integrity of JET and Information Store level.
a) Run ISINTEG -
b) Confirm Jet integrity. This is accomplished through eseutil /g.
4. Try to track down a large message going through the store. A few articles that can help:
830762 Messages that are larger than global size limits can affect server performance
Large Messages Affect Server Performance blog at http://blogs.msdn.com/jeremyk/archive/2004/02/29/81773.aspx
911836 E-mail client users experience a decrease in performance, and event 623 is logged when you receive a large message in Exchange Server 2003
893599 You may experience slow client performance if the version store runs out of memory in Exchange Server 2003
317722 Client latencies occur when Exchange 2000 Converts Mail from MAPI to MIME format
5. Verify online maintenance is completing and its schedule is not conflicting with online backup times. Widen the online maintenance window during off-peak hours if the online maintenance is not completing.
6. Disk I\O performance issues can also be attributed to this issue. Disk I\O latency can account for a slow read transaction and may lead to event id 623 occurring. Use the Microsoft Exchange Server Performance Troubleshooting Analyzer Tool to analyze and fix any Exchange Server performance issues.
The best way to find the root cause of a 623 event id is to enable the " Show Advanced Counters" (aka "Squeaky Lobster" - though Squeaky Lobster is still supported for backward compatibility) & monitor Database > Version Buckets Allocated counter and take a dump once this counter hits 70%, the next time this event id occurs. Jet's "Version buckets allocated" counter will show up when you enable the "Show Advanced Counters" in the registry.
We can determine the cause of 623's if we have a userdump of the store process as the long running transaction is occurring. Unfortunately, this doesn't always catch the culprit because by the time the dump is grabbed, whichever session/thread owning the transaction that was causing the Version Store to build up may have error'd out and gone away. Normally these can be difficult to catch manually.
1. We need to set the "Show Advanced counters" registry key so we can view Jet's "Version buckets allocated" counter. This can be accomplished by setting the following registry key:
For Exchange 2000 Sp2 or later, add the following value in the registry:
On the Edit menu, click Add Value, and then add the following registry value:
Value Name: Squeaky Lobster
Data Type: REG_DWORD
For Exchange 2003 Sp1 or later, add the following value in the registry:
On the Edit menu, click Add Value, and then add the following registry value:
Value Name: Show Advanced Counters
Data Type: REG_DWORD
You will need to restart the perfmon service after setting the key to see the additional counters.
2. We are setting this key so we can set an alert for "Version Buckets Allocated", we have the following 623 event in the case:
Event Type: Error
Event Source: ESE
Event Category: Transaction Manager
Event ID: 623
Description: The version store for this instance (0) has reached its maximum size of y MB(For purpose of this blog I will use 155Mb). It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.
A version bucket is 64K in size in perfmon or the perfmon counter is in units of 64k in Exchange 2000/Exchange 2003. It is in 32k units on 64-bit Exchange 12 and 16k units on 32-bit Exchange12.
The calculation is as follows: x/1024 *64 = y, where x is the number of version buckets allocated and y is the total Version Store memory. Now, we know that the maximum Version Store memory (i.e. y) is 155Mb and we can therefore work out the maximum number of version buckets allocated. x= (155*1024)/64 so we can see that this is 2480.
When we see the version buckets allocated reaching 70% of this maximum then we can say in all probability that we are experiencing a long running transaction and can therefore start to take dumps accordingly. 70% * 2480 = 1736 buckets
3. Create a batch file (call it user.bat and place it in the root of C: drive), with the following information:
cscript d:\debuggers\adplus.vbs -quiet -hang -pn store.exe -r 3 300 -o d:\dumps
Note: You need to ensure that the path to adplus.vbs has been modified to the appropriate location where the Windows debuggers were installed and the directory where you want to place the dump files has enough space to allow for 3 store dump files to be created 5 minutes (300 seconds) apart.
For more information on ADPlus, please review the below article:
How to use ADPlus to troubleshoot "hangs" and "crashes"
4. We need a perfmon alert that will trigger our batch file to dump the store:
a. Open Perfmon
b. Under "Performance Logs and Alerts", right click Alerts and select "New Alert Settings".
c. Give it any name, like Version Buckets or 623. Click OK.
d. Click Add; Select the "Database" as the Performance object, then under Counters select "Version Buckets Allocated". Make sure that only Information Store is selected under instances. Select Add and then close.
e. Then set Alert when the value is "Over" the "Limit" to the value calculated above 1736.
f. Set the interval to 120 Seconds.
g. Go to the Action tab, check "Log an Entry in the application Event log", also check "Run this program" and browse to the batch file which will dump the information store.
h. You may want to configure an alert that will notify you when the Alert is triggered.
Send in the dump file, application log and performance monitor log that were running when the dump was collected to PSS for further analysis.
FYI, in Exchange 12, a new "health check" task has been added in the Store that runs once a minute by default and evaluates usage levels of various Jet resources, including Version Store buckets (the other resources are sessions, cursors, b-trees, and checkpoint depth). For any resources whose usage exceeds a certain threshold (65% by default), a new warning event is emitted. A lot of thought process is going into this to maybe provide an option to crash or otherwise generate a dump somehow if the threshold is exceeded.
Hope this information is useful.