Exchange 2010: VSS backups of passive database copies fail with error C7FF07D7 when preparing snapshot.

Recently I worked with a customer that was having backup failures when attempting to backup passive database copies on an Exchange 2010 DAG member.  Active database copies would backup without any issues.

 

The issue reproduced with both the commercial VSS product and utilizing VSS test procedures with the DISKSHADOW utility.

 

When reviewing the application log at the time of the backup the following events were noted:

 

Time: 6/28/2011 10:34:49 AM
ID: 2021
Level: Information
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange VSS Writer has successfully collected the metadata document in preparation for backup.

Time: 6/28/2011 10:35:15 AM
ID: 9606
Level: Information
Source: MSExchangeIS
Machine: server.company.com
Message: Exchange VSS Writer (instance 0afd4825-b904-4bf0-87ee-93568351c4ca) has prepared for backup successfully.

Time: 6/28/2011 10:35:16 AM
ID: 2110
Level: Information
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange VSS Writer instance 0afd4825-b904-4bf0-87ee-93568351c4ca has successfully prepared for a full or a copy backup of database 'nambx1-old'. The following database will be backed up: <DATABASE>.

Time: 6/28/2011 10:35:16 AM
ID: 2023
Level: Information
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange Replication service VSS Writer (Instance 0afd4825-b904-4bf0-87ee-93568351c4ca) successfully prepared for backup.

Time: 6/28/2011 10:35:17 AM
ID: 2021
Level: Information
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange VSS Writer has successfully collected the metadata document in preparation for backup.

Time: 6/28/2011 10:35:55 AM
ID: 9539
Level: Information
Source: MSExchangeIS Mailbox Store
Machine: server.company.com
Message: The Microsoft Exchange Information Store database "b79d42eb-c574-4ebb-8467-b3d0ec166817: /o=Organization/ou=Exchange Administrative Group(FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=server.company.com/cn=Microsoft Private MDB" was stopped.

Time: 6/28/2011 10:37:05 AM
ID: 2027
Level: Information
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange VSS Writer instance 0afd4825-b904-4bf0-87ee-93568351c4ca has successfully frozen the databases.

Time: 6/28/2011 10:37:26 AM
ID: 2026
Level: Error
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange Replication service VSS Writer (Instance 0afd4825-b904-4bf0-87ee-93568351c4ca) failed with error C7FF07D7 when preparing for snapshot.

Time: 6/28/2011 10:37:26 AM
ID: 8229
Level: Warning
Source: VSS
Machine: server.company.com
Message: A VSS writer has rejected an event with error 0x800423f3, The writer experienced a transient error. If the backup process is retried,
the error may not reoccur.
. Changes that the writer made to the writer components while handling the event will not be available to the requester.
Check the event log for related events from the application hosting the VSS writer.

Operation:
PrepareForSnapshot Event

Context:
Execution Context: Writer
Writer Class Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
Writer Name: Microsoft Exchange Replica Writer
Writer Instance Name: Exchange Replication Service
Writer Instance ID: {a31bfcaa-668f-4a81-9cde-f9dfa2cadd5a}
Command Line: "C:\Program Files\Microsoft\Exchange Server\V14\bin\msexchangerepl.exe"
Process ID: 3972

Time: 6/28/2011 10:37:26 AM
ID: 2031
Level: Information
Source: MSExchangeRepl
Machine: server.company.com
Message: The Microsoft Exchange Replication service VSS Writer (Instance 0afd4825-b904-4bf0-87ee-93568351c4ca) has successfully terminated the snapshot.

The event sequence essentially told us that we were to the point where we wanted to invoke the freeze of the database.  There was a failure directly before this process that caused the replication service VSS writer to abort the backup.  This abortion was in turn returned to the VSS framework and the passive copy backup cleaned up.

 

I specifically focused on the event 2026 with the error C7FF07D7.  While researching I noticed that other products and components also produced C7FF07D7 errors.  In these cases the error was returned when an RPC call between services failed – and a common theme was a networking or connectivity issue.

 

With this information in hand I started to run generic ping tests between the nodes to verify connectivity / dropped packets / etc.  This is where the breakthrough on this particular issue came out.  When pinging the nodes by netbios name the output looked as follows:

 

Pinging NODE [W.X.Y.Z] with 32 bytes of data:
Reply from W.X.Y.Z: bytes=32 time=3ms TTL=128
Reply from W.X.Y.Z: bytes=32 time<1ms TTL=128
Reply from W.X.Y.Z: bytes=32 time<1ms TTL=128
Reply from W.X.Y.Z: bytes=32 time<1ms TTL=128

Ping statistics for W.X.Y.Z:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 3ms, Average = 0ms

 

When pinging the netbios name of an Exchange server we expect that the fully qualified domain would be appended.  Generally this is appended from the DNS search suffix list (if specified) or the AD domain the server is a member of.  In this case there was no domain name appended to the server name.  This either points to an issue with the DNS search suffix list (which was populated appropriately and therefore not our problem) or an entry in the host file.

 

When reviewing the host file the following contents were noted:

 

# Copyright (c) 1993-2009 Microsoft Corp.
#
# This is a sample HOSTS file used by Microsoft TCP/IP for Windows.
#
# This file contains the mappings of IP addresses to host names. Each
# entry should be kept on an individual line. The IP address should
# be placed in the first column followed by the corresponding host name.
# The IP address and the host name should be separated by at least one
# space.
#
# Additionally, comments (such as these) may be inserted on individual
# lines or following the machine name denoted by a '#' symbol.
#
# For example:
#
# 102.54.94.97 rhino.acme.com # source server
# 38.25.63.10 x.acme.com # x client host

# localhost name resolution is handled within DNS itself.
# 127.0.0.1 localhost
# ::1 localhost

W.X.Y.Z NODE

 

At or near the time that the backup issue started occurring the host files on the nodes were modified to include an entry for the members of the DAG.  The entries in the host file only included the netbios name of the members, and did not include the name in fully qualified domain name format.   Once the entry was removed from the host file, dns resolver cache flushed, a ping test was issued and the expected results displayed.

 

Pinging NODE.COMPANY.COM [W.X.Y.Z] with 32 bytes of data:
Reply from W.X.Y.Z: bytes=32 time=3ms TTL=128
Reply from W.X.Y.Z: bytes=32 time<1ms TTL=128
Reply from W.X.Y.Z: bytes=32 time<1ms TTL=128
Reply from W.X.Y.Z: bytes=32 time<1ms TTL=128

Ping statistics for W.X.Y.Z:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 3ms, Average = 0ms

When a passive copy backup is performed (a surrogate backup) there is certain mandatory information that must be exchanged between the replication service on the passive copy and the information store service on the active copy.  This information is exchanged prior to freezing the database to service the snapshot.  If for any reason this information cannot be exchanged the replication service will abort the VSS backup and subsequently the backup will fail.  In this case name resolution between the nodes not working as expected caused this connection to fail and the information exchange to fail.  This prevented passive copy backups from being successful.