Those pesky lazy indices

The_Exchange_Team · ‎Oct 03 2014

In Exchange 2013 there are indices within a given mailbox database. The indices are created, maintained, and deleted by the Information Store Worker Process associated with a given database. These indices are not to be confused with the Exchange content indexes that are built via the Search Foundation engine as they are completely different.

Within an Exchange database there can exist any combination of primary, secondary, and lazy indices There is exactly one primary index and one secondary index on the messages table. For example, a message table exists where a primary index is created on DocumentID and a secondary index created on FolderID, IsHidden, and MessageID. Additionally, other lazy indices may be created that reflect client views, search folders, and even views of search folders. Exchange maintains these indices through two methods, an eager method and a lazy method. Primary and secondary indices are always maintained eagerly. Lazy indices are maintained through the lazy indexing process (although some may be maintained eagerly). There are many lazy indices per mailbox and usually multiple per folder. Confused yet? Let us see if we can explain further…

The eager method says that when an object is inserted into the table the indices must be immediately updated. In the previous example an insert into the mailbox would require updating the primary index and then all secondary indices created against the same mailbox. This results in a random write being issued on each insertion. If a mailbox had 10 indices this would result in 10 random writes. The performance impacts could be significant depending on the structure of the indices. In some cases, the lazy indices exist but are actually not utilized. This results in update cycles being incurred for data that may actually not be utilized. There do exist certain indices where immediate updating is required – this is why the eager method exists.

The lazy method is often utilized to mitigate the performance impact that indexing could cause. When an insertion occurs to the folder an entry is created in a lazy indexing maintenance table with information on the lazy indices that require updating. In this example, two random writes are incurred regardless of the number of lazy indices that require updating. Subsequently when an index is accessed, before returning the results of that index, we apply the maintenance records found within the table ensuring the index is up to date. Three major benefits are derived from this method:

Less random writes are incurred on indices.
If an index exists but is never used we expend no random writes updating it.
If multiple records are inserted at the same time when bringing an index current we can derive some write coalescing.

Issue

Customers have noted that on versions of Exchange 2013 prior to Cumulative Update 6 the following errors are recorded in the Application log and are seen resulting in Information Store Worker Process termination and subsequently database failover. The failovers and terminations may effect single or multiple databases and often result in databases failing over multiple times a day.

Source: MSExchangeIS
Event ID: 1001
Level: Error
Description:
Microsoft Exchange Server Information Store has encountered an internal logic error. Internal error text is (Unable to apply maintenance GetNonKeyColumnValuesForPrimaryKey-norow, index corruption?) with a call stack of (   at Microsoft.Exchange.Server.Storage.Common.ErrorHelper.AssertRetail(Boolean assertCondition, String message)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.HandleIndexCorruptionInternal(Context context, Boolean allowFriendlyCrash, String maintenanceOperation, Nullable`1 messageDocumentId, Exception exception)
...(remaining stack redacted)
   at EcPoolSessionDoRpc_Managed(_RPC_ASYNC_STATE* pAsyncState, Void* cpxh, UInt32 ulSessionHandle, UInt32* pulFlags, UInt32 cbIn, Byte* rgbIn, UInt32* pcbOut, Byte** ppbOut, UInt32 cbAuxIn, Byte* rgbAuxIn, UInt32* pcbAuxOut, Byte** ppbAuxOut)).

Source MSExchangeIS
Event ID: 1002
Level: Error
Description:
Unhandled exception (Microsoft.Exchange.Diagnostics.ExAssertException: ASSERT: Unable to apply maintenance GetNonKeyColumnValuesForPrimaryKey-norow, index corruption?
   at Microsoft.Exchange.Diagnostics.ExAssert.AssertInternal(String formatString, Object[] parameters)
   at Microsoft.Exchange.Server.Storage.Common.ErrorHelper.AssertRetail(Boolean assertCondition, String message)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.HandleIndexCorruptionInternal(Context context, Boolean allowFriendlyCrash, String maintenanceOperation, Nullable`1 messageDocumentId, Exception exception)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.HandleIndexCorruption(Context context, Boolean allowFriendlyCrash, String maintenanceOperation, Nullable`1 messageDocumentId, Exception exception)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.GetNonKeyColumnValuesForPrimaryKey(Context context, Object[] primaryKeyValues)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.DoMaintenanceDelete(Context context, Byte[] propertyBlob)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.ApplyMaintenance(Context context, LogicalOperation operation, Byte[] propertyBlob)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.SaveOrApplyMaintenanceRecord(Context context, MaintenanceRecordData maintenanceRecord, Boolean allowDeferredMaintenanceMode)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.BuildDeleteRecords(Context context, IColumnValueBag updatedPropBag, Int64& firstUpdateRecord)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.BuildUpdateRecords(Context context, IColumnValueBag updatedPropBag, Int64& firstUpdateRecord)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndex.LogUpdate(Context context, IColumnValueBag updatedPropBag, LogicalOperation operation)
   at Microsoft.Exchange.Server.Storage.LazyIndexing.LogicalIndexCache.TrackIndexUpdate(Context context, Mailbox mailbox, ExchangeId folderId, LogicalIndexType indexType, LogicalOperation operation, IColumnValueBag updatedPropBag)
...(remaining stack redacted)
   at Microsoft.Exchange.Common.IL.ILUtil.DoTryFilterCatch(TryDelegate tryDelegate, GenericFilterDelegate filterDelegate, GenericCatchDelegate catchDelegate, T state)).

Source: MSExchange Common
Event ID: 4999
Level: Error
Description:
Watson report about to be sent for process id: 9112, with parameters: E12, c-RTL-AMD64, 15.00.0913.022, M.E.Store.Worker, M.E.S.Storage.LazyIndexing, M.E.S.S.L.LogicalIndex.HandleIndexCorruptionInternal, M.E.Diagnostics.ExAssertException, a762, 15.00.0913.000.
ErrorReportingEnabled: True

Why does this issue occur?

Prior to CU6 it was possible that certain index maintenance operations would overlap each other. This would result in an inconsistency between an eager index update and a lazy index update that would happen later. More specifically, we missed outputting an update that was needed later which gets the index into a corrupted state. This would essentially mark the index as corrupted when the lazy index operation could not be applied later, causing the crash, and requiring the index to be rebuilt.

How did Microsoft find this bug?

Although the result of the index corruption is a failover there was no availability impact in the service. The failovers successfully fixed the indices and no client impact is occurred. (It should be noted that the same is reported in on-premises installations). Thanks to customers that have enabled automatic error reporting an uptick in reports related to lazy indexing was noticed. This allowed our development teams to evaluate the code in question and issue a fix.

Why does the Information Store Worker Process terminate due to lazy indexing?

The lazy indexing maintenance process encounters issues arising to the maintenance of the indices. When inserting a record into an index we assume the record should not already be in the index. When removing or updating a record within the index we expect that the record already exists. Due to an issue in how the indices were previously built these constraints are violated. Our method of handling this is to terminate the Information Store Worker Process, which results in a database copy failover. The index itself is also deleted and then rebuilt the next time the index is accessed. Although a failover occurred the high availability framework should quickly restore access to the database and the end user should not be impacted. The corrupted index is self healed. Indices and index information is logged via transaction logging and subsequently replicated to other database copies if a Database Availability Group is utilized.

Does rebuilding the content index or reseeding the content indices correct this issue?

The status of the content index catalogs has no impact on this issue. They are two separate indexing concepts unrelated to each other in the context of this issue.

Does reseeding the database copies or removing /adding the database copies correct this issue?

No. The indices are stored within the database and any corrupted indices would be reseeded with the database.

How do I correct the lazy indexing failures and prevent database failovers?

The majority of incorrect indices occur with Exchange 2013 Cumulative Update 5. The issue was first identified through automatic error reporting in Exchange 2013 and subsequently identified in Office 365. It was then fixed in a post Exchange 2013 CU5 build deployed to Office 365 where the incidents of index corruption decreased. The fix has been incorporated into Exchange 2013 Cumulative Update 6. Customers should upgrade to Exchange 2013 CU6 to correct the index creation issue and allow future index operations to proceed successfully.

Does Exchange 2013 Cumulative Update 6 prevent a lazy indexing failure?

No. There are certain reasons an index may be considered corrupted. Isolated index corruption with subsequent self healing may occur. It is also important to note that indices could have been created in Exchange 2013 CU5 that are corrupted. When these indices are accessed on Exchange 2013 CU6 and newer a database failover may result as the indices are still corrupted. Exchange 2013 CU6 corrects the building of the initial indices which should decrease the frequency of lazy indexing resulting in database failovers.

Can I expect failovers for the foreseeable future?

In the short term after application of Exchange 2013 CU6 customers may continue to experience failovers if a corrupted index is accessed. The Information Store process automatically cleans up indices that are not accessed after 30 days. The issue should self correct either though failover and immediate cleanup or after the indices are aged out.

Can I do something today to correct the corrupted indices?

There are two interventions administrators can utilize to discard corrupted indices. The first is to move the mailboxes to a different database. The move process discards indices during the move. The second is to execute a New-MailboxRepairRequest with the CorruptionType “DropAllLazyIndices” parameter against the mailbox. The New-MailboxRepairRequest effectively sets the index age timeout for a given mailbox to 0. The repair process will render the mailbox inaccessible while the repair is in progress and could have significant performance impacts on the server. WE DO NOT recommend either of these options since they would have to be run in bulk against all mailboxes whether or not they have corrupted indices. There is no proactive method to scan for index corruption and identify mailboxes to target the move or request against.

Customers that have opened cases with support have reported a significant decrease in the number of failovers associated with lazy indexing after the application of Exchange 2013 CU6. Failovers will continue until all corrupted indices have been accessed, deleted, and subsequently rebuilt. Customers who experience this issue are advised to test and deploy Exchange 2013 CU6 as soon as possible. If upgrading is not possible the database failovers may continue with no other negative side effects noted.

Tim McMichael
Senior Support Escalation Engineer

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs