MSExchangeRepl 2147 / MSExchangeRepl 2104 / MSExchangeRepl 2127 occurring on Windows 2008 or Windows 2008 R2 with Exchange 2007 Cluster Continuous Replication (CCR)

When Exchange 2007 CCR is installed on Windows 2008 or Windows 2008 R2 the following error may be noted in the application log of the passive node:

Log Name: Application
Source: MSExchangeRepl
Event ID: 2104
Task Category: Service
Level: Error
Keywords: Classic
User: N/A
Computer: MACHINE
Description:
Log file action LogCopy failed for storage group EXCLUST01\SG2. Reason:
CreateFile(\\Server\StorageGroupGUID$\LogFile.log) = 2

If the CCR cluster is not utilizing continuous replication host names the following event series may also be noted:

Event ID : 2147
Raw Event ID : 2147
Source : MSExchangeRepl
Type : Error
Machine : SERVER
Message : There was a problem with 'ActiveNode', which is an alternate name for 'ActiveNode'. The list of aliases is now 'ActiveNode', and the alias 'was' removed from the list. The specific problem is 'CreateFile(\\ActiveNode\StorageGroupGuid$\LogFile.log) = 2'.

ID: 2127
Level: Information
Provider: MSExchangeRepl
Machine: SERVER
Message: The system has detected a change in the available replication networks. The system is now using network 'ActiveNode' instead of network 'ActiveNode' for log copying from node ActiveNode.

In this situation if the solution is aggressively monitored you may not that replication is temporarily failed and then resumes automatically as healthy. This occurs due to a temporary pause in replication when the error condition is detected, while the replication service attempts to find other replication paths, and then automatically re-attempts the same copy operation.

If the CCR cluster is utilizing continuous replication host names the following event series may also be noted:

Event ID : 2147
Raw Event ID : 2147
Source : MSExchangeRepl
Type : Error
Machine : SERVER
Message : There was a problem with ‘ReplicationHostName’, which is an alternate name for 'ActiveNode'. The list of aliases is now 'ActiveNode', and the alias 'was' removed from the list. The specific problem is 'CreateFile(\\ReplicationHostName\StorageGroupGUID$\LogFile.log) = 2'.

ID: 2127
Level: Information
Provider: MSExchangeRepl
Machine: SERVER
Message: The system has detected a change in the available replication networks. The system is now using network 'ActiveNode' instead of network ‘ReplicationHostName’ for log copying from node ActiveNode.

Error 2 is ERROR_FILE_NOT_FOUND

In this situation the error is detected on the replication host name. The replication service will temporarily pause replication while other network paths are enumerated. If other continuous replication host names are in use, the replication serivce will select an alternate replication host name and automatically resume log copying. If the only path valid is the “public” path, the replication service will begin copying log files over the “public” network. Eventually this error occurs on the public network, forcing network re-enumeration to occur and replication to automatically switch back to the replication network. If the solution is aggressively monitored, the replication status may be failed during this switch but will automatically resume healthy.

In almost all incidences these errors are considered benign to the operation of the Exchange Server.

The replication service is extremely aggressive in its attempts to copy log files. The replication service is always aware of the next log file in the series that requires copying to the passive node. As part of normal processes the replication service may query multiple times for the presence of this file and make copy attempts. These attempts may result in the replication service querying for a log file that is not fully available. Under Windows 2003 this was not necessarily an issue. Windows 2008 introduces a component into SMBv2 that may cause this to be a problem.

SMBv2 introduces status caching into the LanManWorkstation service.  When an application requests information from a file share, the workstation service caches the response from the server hosting the share.  Subsequent requests for the same information are returned from cache rather than re-contacting the server hosting the share.  Eventually this cache will expire (in our case it expires by the time replication is failed / resumed <or> a switch between replication host names occur).  The replication service has received feedback that the log file in question should not be available for copy, attempts to copy it, and receives an older return status that the file is not ready (even though the file does exist on the source at the time the attempt is made).  In turn the replication service detects this as an error condition and takes action.

From a Windows 2008 / Windows 2008 R2 perspective this is by design.

To correct these errors on an Exchange 2007 / Windows 2008 <or> Exchange 2007 / Windows 2008 R2 implementation, the following registry keys should be set to a zero (0) value and the nodes rebooted:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanworkstation\Parameters

FileInfoCacheLifetime [DWORD]

FileNotFoundCacheLifetime [DWORD]

DirectoryCacheLifetime [DWORD]

If the DWORDs are not present they may need to be created.  The recommended value is HEX / DEC 0.

More information on these keys can be found here: https://technet.microsoft.com/en-us/library/ff686200(WS.10).aspx  (Note that registry path in the article is missing the SERVICES hive – correct path in blog post).