Exchange and VSS — My Exchange writer is in a failed retryable state…


In Exchange 2007 and Exchange 2010 many customers are leveraging VSS based backups to retain and protect their Exchange data.  By default Exchange provides two different VSS writers that share the same VSS writer ID but are loaded by two different services.  The first is the Exchange Information Store VSS writer and the second is the Exchange Replication Service VSS writer.  The Information Store writer allows for the backup of active / mounted databases and the replication service writer allows for the backup of passive databases (should a replicated database model be utilized).  You can see the writers by running the command VSSADMIN LIST WRITERS from a command prompt.

 

Here is a sample put of a VSSAdmin List Writers from a Windows 2008 R2 SP1 server with Exchange 2010 SP1.  Note how both writers share the same writer ID within the VSS framework.

 

Writer name: ‘Microsoft Exchange Replica Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [1] Stable
   Last error: No error

 

Writer name: ‘Microsoft Exchange Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

In the course of protecting Exchange servers there maybe conditions that cause a backup job to fail.  When an Exchange backup job fails the VSS framework aborts the backup and subsequently Exchange clears the backup in progress settings.  When a failure is encountered either a single Exchange writer or both Exchange writers maybe left in a FAILED RETRYABLE state.  We can utilize VSSAdmin List Writers again to query the writer status and see these results.  Here is an example showing the Exchange Replication Service writer with a status 8 FAILED last error RETRYABLE.

 

Writer name: ‘Microsoft Exchange Replica Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [8] Failed
   Last error: Retryable error

 

Writer name: ‘Microsoft Exchange Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

 

Now the typical question that comes up at this point is how do I actually deal with an Exchange writer that consistently disallows backups.  The answer – restart the service that the writer was associated with and/or fix whatever configuration issue is causing the failures.  For example, given the above output I would restart the Exchange Replication Service in an attempt to return the writer to a Stable No Error state.  (If it would have been the Microsoft Exchange Writer I would have restarted the Exchange Information Store Service).

The real question though is do I need to deal with a writer that is in a failed state?  Unfortunately many administrators find themselves having to deal with a writer in a failed state because their experience is that while the writer is in a failed state subsequent backup jobs fail.  If reviewing the issues carefully what you’ll find is that the backup jobs are not failing because of a VSS failure but rather they are failing because a writer was found in a failed state.  From an Exchange / VSS perspective this is unexpected –> after all although the writer is failed the error is RETRYABLE –> essentially saying “hey…something failed but come on back and try me again…”

 

Let’s take a look at why this might be happening….

 

Within the VSS framework there are two states that we are interested in –> the Session State and the Current State.  When a VSS session is in progress, and an administrator runs VSSAdmin List Writers, the state that is displayed is the current session state.  When the VSS snapshot creation has completed, the current state becomes a session specific state and the status of the most recently completed session is copied to the current state.  At this point when the administrator runs VSSAdmin List Writers the state of the most recently completed session is displayed.  This is an important distinction  –>  the SESSION STATE AT THIS POINT REFLECTS THE STATUS OF THE LAST SESSION!  The status of the last session does not imply anything in regards to the success <or> failure of future sessions.

Now that we know where VSSAdmin List Writers gets its information we’ll take a look at how the backup process should progress.  (I’m going to attempt to present an overly simplified timeline of an expected backup)

The process starts with the VSS requester establishing a VSS session. 

 

image

 

After the session is established the VSS requester requests metadata from the VSS framework.

 

image

 

At this point the VSS request and VSS framework further progress the snap shot process by determining components and preparing the snapshot set.

 

image

 

Once the components and snapshot sets have been prepared the VSS requester issues a PrepareForBackup.  This in turns causes the VSS framework to prepare the components for backup.

 

image

 

After prepare backup is called the individual application level writers are now responsible for current writer status.  The VSS requester is now allowed to call GatherWriterStatus.  This call in turn should return the current writer status.  For example, current writer status at this stage could be FREEZE / THAW / etc.  This is regardless of if the previous status was FAILED or HEALTHY.  This is the status that the VSS requester should be utilizing to make logic decisions at this point.

 

image

 

Once the snapshot is created the contents can then be transferred to the backup media.  Once the transfer is complete, the VSS requester can inform the VSS framework that a backup has completed successfully and subsequently the VSS session ended.

 

image

 

In summary if the VSS requester is performing operations in an order that is expected, the writer status should be queried after the framework has received a prepare for backup event.  This will ensure the writer status reflects that of the CURRENT SESSION IN PROGRESS and not the SESSION STATE OF THE PREVIOUS BACKUP.

 

The administrator can verify the functionality of the Exchange writer by utilizing the VSHADOW or DISKSHADOW utilities.  These utilities utilize the workflow outlined in the successful handling of a failed retryable writer case.  If either of these utilities are successful in creating the backup, and the writer in turn is returned to a healthy state you might consider following up with the backup vendor to ensure VSS calls are being made appropriately.  Microsoft can also assist you in verifying the calls are made appropriately through assisting with both Exchange and OS VSS tracing.


Comments (47)

  1. TIMMCMIC says:

    @Armando…

    So for Exchange tracing you can turn on some trace tags as well as turn up diagnostic logging. This captures the Exchange portions of these events. Honestly though – to get it right – you have to be working with support. The output file is not readable by you
    as an administrator.

    TIMMCMIC

  2. TIMMCMIC says:

    @Adam:

    OK – so we can focus on the first error which is backup in progress. Let’s expand on our previous steps:

    1) Gather metadata.
    2) Call preparation
    3) Prepare snapshot
    4) Present snapshot
    5) Validate snapshot
    6) Copy snapshot to media
    7) Invoke backup complete

    Now – invoking backup complete is a three step process:

    1) Your backup software tells VSS backup is complete.
    2) VSS tells exchange backup is complete.
    3) Exchange acks backup complete.

    Now the fact that you are getting blocked up front because backup is in progress is expected if there is an actual backup in progress.

    Here’s what you need to do to effectively troubleshoot this.

    1) Restart the exchange information store and replication service.
    2) Ensure all vss writers are healthy.
    3) Ensure that if you do a get-mailboxDatabase -status | Fl *backup* that backup in progress shows as false.

    Then you need to monitor your backup application. If your backup application shows that the backup is "complete" but the get-mailboxdatabase -status shows that backup is still in progress, something happened with step 1-3. What do we look at here – we look
    at the application logs around the time the backup software said the backup completed. We should see in the logs where a backup complete was received – if not we need to troulbeshoot this out to in. What I mean by that – we need to work with the vendor to
    ensure that backup complete is being called, then we can look at VSS and Exchange tracing (which unfortunately would require a case).

    9/10 I find that backup complete never called, the agent crashed etc, which then causes everything to topple over moving forward.

    TIMMCMIC

  3. TIMMCMIC says:

    @Adam:

    Yes – that’s what we’re here for.

    TIMMCMIC

  4. Anonymous says:

    @TT:

    I am suggesting that there are occassions where this type of issue occurs becuase of the order of calls the backup vendor is making.

    More importantly what I'm suggesting though is that a failed writer is not necessarily something that needs to be fixed.  In supporting these types of cases what i've noticed is a lot of attention paid to the state of the writer and "fixing" the writer from a failed state.  In theory the writer being failed is simply telling you that a previous operation failed and should not preclude the taking of future backups (and therefore is not something that requires fixing before continuing to troubleshoot a backup issue).

    TIMMCMIC

  5. Anonymous says:

    @TT:

    In most cases a writer left in a failed state means the previous backup failed.  What you are hopefully looking at is what lead up to the failed writer and not the failed writer itself…

    TIMMCMIC

  6. ItalianDutch75 says:

    indeed we do disagree, the retryable erorr is the worse written error written by Microsoft ever and I will not point our customers to this thread, as it won’t help them. end of story.

  7. Anonymous says:

    Andreas:

    First and foremost I think it's a great assumption on your part that there's some deficiency in either VSS or Exchange that is causing your issue.

    If you read the error you'll see that it indicates a timeout has occured.  VSS / Exchange has 30 seconds to allocate the snapshot in order to satisfy the backup.  When a timeout error occurs it's usually related to anciallary items:

    1) The incorrect volume formatting is utilized (for example an Exchange server should utilize 64K formatting not the default 4K)

    2)  The volume is defragmented (being exacerbated by item 1 in this list).

    3)  The hardware is having issues at the time of the backup.

    Maybe you should open a support case and allow the issue to be investigated.  I would also have expected the backup vendor to also have some insight into these types of issues.

    TIMMCMIC

  8. TIMMCMIC says:

    @MD2000…

    Maybe it’s time for me to just write a different blog post – this one has taken on a life of it’s own…

    To answer your question…

    The item addressed in this blog post was written when we saw a rise in VSS cases due to writer retry able. When we started tracing we found that if something had previously failed, and the writer was left failed retryable, all backups from that point forward
    would fail without VSS even being attempted. (IE – the backup software was making a determination that the server was unhealthy based simply on the writer status). Fortunately it’s been quite sometime since we have seen an issue like that.

    The second thing I was trying to highlight here is that we still today see customers call and references from other vendors that you need your VSS writer fixed. It needs to be fixed because it’s failed / retryable -> a failed retryable writer is a symptom /
    result not something that unto itself needs to be fixed.

    So in your particular case you’ve ended up with a VSS writer that is in a failed / retryable state. This state indicates that something failed -> come try it again. How do you go about fixing it – you go about finding the failure to begin with. (That sounds
    simple – but sometimes it is not so simple).

    A VSS backup really boils down to a few stages:

    1) Collection of VSS metadata and components.
    2) Execution of the snapshot.
    3) Verification of the exchange data (optional).
    4) Transfer of data to media.
    5) Notification of backup complete.

    To determine why a VSS writer is failed / retryable we need to start by looking at the application log. There is an event sequence that fires for each one of these items. You need to follow the event sequence through and see where the failure actually occurred.

    If the metadata and snapshot complete successfully -> and the errors occur in data transfer -> then we need to consult the backup job log. Actually – we really need to consult the backup job log anyway to see where it thinks there was a failure at too.

    I don’t have time at the moment to list the actual event sequence – and it varies by Exchange version – but hopefully this helps you.

    BTW – to get the writer back to no error (which should not be necessary) – you need to restart the service associated with the writer.

    TIMMCMIC

  9. TIMMCMIC says:

    @Adam:

    That screen shot looks good.

    TIMMCMIC

  10. TIMMCMIC says:

    @8bit_pirate… Thanks for the comment. It would depend on what it means for diskshadow to have an issue. The writer being in a failed retryable state prior to running disk shadow should not be an issue and should not cause disk shadow to fail. If the
    writer is found in this state – and diskshadow fails – then it’s possible there is an issue with core VSS that needs to be investigate. TIMMCMIC

  11. TIMMCMIC says:

    @Adam…

    So let’s take stock of what we know. The VSStester leverages disk shadow…and we know that we can create a snapshot and present it to the OS. This is good – diskshadow completely exercises the VSS framework. So…we know we can create a snapshot, that volsnap
    can mount it, and that we have access to it via the operating system.

    Now the event IDs you posted here could be generic – if you want to post specific ones i’ll be more than happy to take a look at it.

    The snapshot process is not that difficult – understanding why it’s failing we need to understand where it’s at in the process.

    1) Create the snapshot.
    2) Perform consistent check
    3) Transfer data.

    TIMMMCMIC

  12. ItalianDutch75 says:

    Tim,
    what you are referring to are corner cases that you may have had with badly written VSS requestors;
    however my experience lists hundreds of cases per year with a well written requestor which won’t be able to backup exchange simply because one of the exchange writers (‘Microsoft Exchange Writer’ or ‘Microsoft Exchange Replica Writer’) decided to sit in a bad
    state and only a service restart, enable disable, or in many cases an exchange server reboot would resolve the problem for a good while (days or months).
    Our VSS provider and VSS requestors are designed according to Microsoft guidelines on
    http://msdn.microsoft.com/en-us/library/aa384615%28v=vs.85%29.aspx, so in accordance with what you describe should be the correct way.

    I guess the reason why I keep replying you is that I don’t like to see a well written blog with such a big flaw in it and I am referring to when you write about "the retryable erorr is re-tryable and the only thing to do is go to the third party software vendor".
    Although that may be the case for some other corner cases (that I haven’t seen) that’s definitely wrong for many others and it makes all backup vendors (including Microsoft partners) unnecessarily look bad.

    You also mentioned that in those cases end users can also seek help at Microsoft and that’s great, but I guess the ask is to add some more useful tips on how else to resolve the problem and add a more detailed description of the error. the error is not a retryable
    erorr, not in my experience and you need to always restart some services or reboot to resolve it. if you look this error on google, you find tons of hit of people who resolve the case this way. What you described here so far is just a corner case that you
    have experienced and don’t take me wrong, that’s fine, as long as you list the other thousand cases which are generated because of other more valid reasons 😉
    and just to be clear, I am referring to cases where windows backup software backup of same database copy FAILS too.
    hope that helps getting my point through this time.

  13. TIMMCMIC says:

    @Domenico

    No problem – most customers locate this thread on their own without needing direction…

    Should you want to further a discussion offline I can be contacted through the blog.

    TIMMCMIC

  14. Anonymous says:

    @Habibablby:

    Without the full application log it's going to be hard to predict why the freeze of Exchange is failing.

    I would suggest opening a case with PSS if you could.

    TIMMCMIC

  15. TIMMCMIC says:

    @Domenico:

    I guess we will need to agree to disagree. At the time this blog post was written I’d say this was far from a corner case.

    To your broader point though – when diskshadow / betest / and windows server backup fail there is plenty of evidence that this is not a third party problem per sae. (I find a lot of times we introduce other issues not related to the original question just trying
    to get some of these things to work).

    I’ll still stand by what I’ve said here though – the retryable error is not what needs to be fixed.

    TIMMCMIC

  16. TIMMCMIC says:

    @Domenico:

    I appreciate the feedback. Still though -> an exchange writer that is in a failed retry able state still does not need to have services restarted in order to “fix” something. Take an example that I see pretty regularly. We’re using third party software X. Third party software X experiences a failure to communicate with a central server. This causes that backup to “fail” and leaves the writer in a “failed retry able state”. I then take that same product and I attempt to perform a backup with third party software X -> and it’s successful! Yes – the operation is completely successful when the writer was originally found in a failed retry able state. And yes – in many cases a writer in failed retry able has nothing to do with VSS itself and most likely should be referred to the third party backup software.

    You can also observe the same thing using the VSS tester. In many cases when the writer is failed retry able the diskshadow / vss tester script will work just fine and a backup is taken successfully. Further evidence that a failed retry able writer is not necessarily something that needs to be fixed.

    Now – there are several cases where the writer is failed retry able and subsequently all backup attempts fail not only with third parties but also with the vss tester script. This usually indicates some form of legitimate issue within Exchange or VSS.

    TIMMCMIC

  17. TIMMCMIC says:

    @Armando….

    It’s always the backup software 🙂

    Honestly in my experience I almost always use the VSS tracing from the operating system. Most errors that have anything to do with Exchange are apparent in the logs. If you get the properties of the server, under MSExchange IS you’ll see entries for VSS where
    you can change them to MAX. This will give you app log events that show the process in more detail.

    The best thing you can do is look at the expected event sequence that Exchange produces and match that up to the VSS workflow. I find when working with vendors we fail to have a common understanding. For example, collect metata, call on prepare for X components,
    this triggers exchange to do Y event, etc.

    TIMMCMIC

  18. TIMMCMIC says:

    @Adam

    ok, i see. so just to confirm in case no backup is executed and the query get-mailboxDatabase -status | Fl *backup* will not show any database being backed up

    *Correct – if no backup is in progress there should be no backup showing TRUE.

    could potentially rule a issue with the 3rd party backup vendor out, yes

    *yes and no. You need to get yourself to a steady state first -> no backup in progress set to true and then the writers in a stable status. Only then can you answer this question.

    btw. just to get another thing righbt because it looks like no to be an 100% easy question – does Windows Backup support backup of passive copies in a Exchange 2010 – 2013 DAG ?

    *2010 = no
    *2013 = yes

    TIMMCMIC

  19. ItalianDutch75 says:

    @ TIMMCMIC,
    if you really are here to help, please post some advice for all end users who are hitting the VSS_E_WRITERERROR_RETRYABLE error.
    Simply telling users to go to the third party backup software vendor will not solve the problem, it will delay resolution and it will make lots of people feel frustrated, including yourself as, in the end, they will likely come back to you.
    I have 15 years experience working with email databases like lotus notes and exchange on windows and as I am working at a Backup Sofware company I learned how to resolve Microsoft issues, issues which Microsoft should resolve themselves.
    What customers need is an answer from Microsoft as to what he needs to do to resolve the case, now that’s what I call a helpful attitude. Pushing back is not what I call helpful.
    Thanks.

    Domenico.

  20. TIMMCMIC says:

    @Domenico:

    Thanks for taking the time to comment. Considering your stated experience I thought you might appreciate the actual issue that is being highlighted in this article.

    When this article was first authored it was written for two reasons:

    1) Highlight a legitimate issue that existed in third party backup software.
    2) Explain what it means to have a retryable writer.

    At the time on a weekly basis we were seeing customer complaining that they had to restart the information store or replication service everytime they had a failed backup. The failures had nothing to do with the Exchange or Windows infrastructure – and were
    mostly linked to the inability to commit to media servers or agents loosing connections etc. So of course when this happens the backup fails and the writers go into a failed retryable state. The real complaint here was not why did the backup fails (most customers
    knew this) but why would Exchange / Windows force us to clear a writer to healthy before allowing us to backup again. In this instances windows server backup or diskshadow would backup without an issue clearly demonstrating the ability to exercise VSS and
    take a backup yet third party products were failing. What was discovered in this investigation was what is highlighted in this article as an issue – the VSS calls in question were being made out of order. The future success of backups should not be determined
    by the gather writer status prior to issuing an on prepare. The gather writer status should occur after the on prepare (allowing us to determine if the writer went from failed / retryable to preparing -> in which case VSS has successfully started).

    Ironically we still see advice to customers that indicate restarting services to reset writer status is a necessary pre-requisite for backups to function.

    So as to the issue that is documented here – although it is not necessarily running wild any longer (I cannot think of a vendor VSS log that I’ve seen this in quite a while) none the less it is still legitimate.

    To the overall concept of what it means to have a writer that is retryable I think this information is also still valid. The prevailing idea that just because a writer is retryable means something is broken and we need to fix that in order to be successful
    is not correct.

    Feel free to contact me through comments or the contact link on the blog if you’d like to discuss further.

    TIMMCMIC

  21. Anonymous says:

    @AdamJ:

    The answer is that it’s very circumstantial and would depend on the specific error of the writer as well as what the state of the Exchange portion of the backup was left in.

    TIMMCMIC

  22. Anonymous says:

    @Marianne Rooney :

    Thanks for posting some feedback that definitely confirms my point.

    This error in my experience is usually generic – and sometimes it most likely does not reproduce. By itself I do not have an explanation without looking further in the logs at anything around it.

    To you restore tab – this is not uncommon. In many cases the database and log files are backed up and this error is thrown when calling backup complete. So you actually have a valid restoration point from the perspective of the software but not from Exchange
    (ie – backup complete was never successfully called). I’d bet you’d be most likely ok restoring that backup. This happens a lot to when multiple databases are included in a backup job, and it failed on 3 of 4 – the previous 2 already committed to media are
    still restorable from that job even though backup complete was never successfully called on them.

    TIMMCMIC

  23. TT says:

    Hi…so you suggesting the logic need to be fixed from backup vendor rather than MS? Believe vendors have worked together with MS before releasing the product and most likely they already aware of the issues/fixes? This affects all backup products…

  24. TT says:

    but most of the time after fixing that into stable/no error it works.. 🙂

  25. andreas says:

    @TIMMCMIC: What then, if TSM TDP backup fails consistently for weeks with the following error:

    ANS5261W An attempt to create a snapshot has failed.

    ANS1327W The snapshot operation for 'INT-EXCHDB-01Microsoft Exchange Writer{76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}Mailbox Database 09c2244697-3994-4587-8234-e2bc9bbd4e79' failed with error code: -1.

    Clearly, the VSS "Last Error" message indicates that something isn't working as intended.

    Writer name: 'Microsoft Exchange Replica Writer'

      Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}

      Writer Instance Id: {15642e98-5677-43ce-9a56-8dd1a6f32745}

      State: [7] Failed

      Last error: Retryable error

    Writer name: 'Microsoft Exchange Writer'

      Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}

      Writer Instance Id: {bddc9190-b6e2-4a64-b07b-58280cd7de33}

      State: [7] Failed

      Last error: Timed out

    What is the backup application supposed to do other than report the problem and ask Microsoft to please fix their product so it doesn't fall over every time someone tries to use it in the real world?

  26. martola says:

    Hi TIMMCMIC,

    what do you mean when you say "The volume is defragmented (being exacerbated by item 1 in this list)."

    did you mean the volume is fragmented?  Please let us know.

  27. Habibalby says:

    Been with this for awhile and now again it's happened on my Exchange 2007 and Veeam Backup.

    11/16/2012 8:43:20 PM :: Unable to release guest. Error: Unfreeze error: [Backup job failed.

    Cannot create a shadow copy of the volumes containing writer's data.

    A VSS critical writer has failed. Writer name: [Microsoft Exchange Writer]. Class ID: [{76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}]. Instance ID: [{8ea7190d-337c-448f-b264-3401303b586b}]. Writer's state: [VSS_WS_FAILED_AT_FREEZE]. Error code: [0x800423f2].]

  28. Domenico says:

    Tim,
    First of all, thanks for the great article.

    the frustrating problem people face and which I felt it is not ‘spelled out’ well in this article is that when VSS writer goes into a retryable state, you can try as many times as you want, but backup will fail. So it is not what it says it is, if you know what I mean. You seem to be under the impression to take the error literarly, but if you do have some backup experience, I am pretty sure you will be disappointed and find out that it isn’t retryable as it says it is.

    Backup may fail for whatever reason and we are not asking VSS to fix all backup related issue, of course.
    I think the problem with VSS writer is that even if you have fixed the root cause for the writer to be in a error state, the writer won’t allow you to backup exchange.

    for the VSS requestor software makers, this is a problem because customers keep coming to the backup application software and they want an answer from them. What customer need is an answer from Microsoft as to why you need to restart VSS writer related services to clear the state.
    So, my experience with Exchange is that ONCE Microsoft Exchange Writer goes into a bad state, it is not possible to get out of it without restarting a bunch of microsoft exchange related services.
    Now, it seems you have been focusing on the reason why this happened in the first place.
    That’s good so you can prevent the same problem from happening next time.
    but it will not clear the state of the writer.

    hope that clarifies this for all readers.

    Domenico.

  29. 8bit_pirate says:

    Okay, but what if it’s DISKSHADOW that is having this issue?

  30. Geezman says:

    Getting the Retryable Error always on the same DB. When i restart the Replication Service, i am able to back up the DB but during the nightly backup schedule always the same DB gets hung on retryable. Why?

  31. Armando says:

    Hi Tim, you mentioned "..assisting with both Exchange and OS VSS tracing". It appears it is posible to trace VSS just for Exchange? If that’s so, can you please tell me or point me to procedure that explains how to perform Exchange VSS tracing? Thanks
    in advance!

  32. Armando says:

    Thanks for your quick response. The reason I ask is because I work doing support for a backup software that protects Exchange and other Microsoft applications integrated with VSS and I was hoping I could learn better ways to troubleshoot backup failures
    related to VSS where the symptoms are not clearly pointing at either side. I am aware of how to set up VSS tracing in general, however, since you separated Exchange tracing from OS VSS tracing, I thought there was a difference. I’ll explore using an increased
    diagnostic logging level though. Regards!

  33. adam says:

    hi.

    ok. so a failed retrayable error is nothing bad. lets consider following scenario. i have a Exch 2010 fully patched. the Exchange Writer is stable without any errors (checked with vssadmin list writers). i take diskshadow.exe to utilize the VSS functionality
    on my Exchange DB and Logs drives. all is "good" all drives get exposed in Windows Explorer but still i see same errors in Event Viewer as at the time when backing up with a 3rd party sw:

    ID 9840 ID 9814 ID 2007 and ID 9609

    so there’s clearly something going wrong ?

    used VSSTester.ps1 to troubleshoot but ExTRA logs can’t be just simply investigated by non – Microsoft technicians, see:

    http://support.microsoft.com/kb/971878

    well, it’s easier to say "it’s not use ? " huh 😉

    but let’s be honest. there are no really good KB’s etc. describing how to troubleshoot those issues – sure we can pay for MS support but from my experience i would expect MORE then what i used to et in the future 😉

    still looking forward for an answer.

    regards

    adam

  34. adam says:

    hi.

    ok. so a failed retrayable error is nothing bad. lets consider following scenario. i have a Exch 2010 fully patched. the Exchange Writer is stable without any errors (checked with vssadmin list writers). i take diskshadow.exe to utilize the VSS functionality
    on my Exchange DB and Logs drives. all is "good" all drives get exposed in Windows Explorer but still i see same errors in Event Viewer as at the time when backing up with a 3rd party sw:

    ID 9840 ID 9814 ID 2007 and ID 9609

    so there’s clearly something going wrong ?

    used VSSTester.ps1 to troubleshoot but ExTRA logs can’t be just simply investigated by non – Microsoft technicians, see:

    http://support.microsoft.com/kb/971878

    well, it’s easier to say "it’s not use ? " huh 😉

    but let’s be honest. there are no really good KB’s etc. describing how to troubleshoot those issues – sure we can pay for MS support but from my experience i would expect MORE then what i used to et in the future 😉

    still looking forward for an answer.

    regards

    adam

  35. adam says:

    * i meant in the past 😉

  36. adam says:

    hi,

    wow i’m more then positive surprised for such a quick response.

    the event ID’s generated during the DISKSHADOW.exe test are as follows (my OS is in german so will try to translate or attach a MS KB to each event :

    ID 9840 Source MSExchangeIS

    An attempt to prepare the storage group ‘Normal’ for backup failed because the storage group is already in the process of being backed up. The error code is -2403

    http://www.microsoft.com/technet/support/ee/transform.aspx?ProdName=Exchange&ProdVer=8.0&EvtID=9840&EvtSrc=MSExchangeIS&LCID=1033

    ID 9814 Source MSExchangeIS

    Exchange VSS Writer (instance a3bde017-6a6c-45ad-be73-43511e2eab56:33) failed with error code -2403 when preparing the database engine for backup of Database "Normal"

    ID 2007 ESE

    Shadow copy instance 32 aborted.

    ID 9609 Source MSExchangeIS

    Error code (Instanz 3202f545-e55e-4245-b32a-0d26a1e20f6b:32): when preparing for Snapshot. error code -2403

    i did net helmpsg 2403 – This share name or password is invalid.

    But what’s strange after i executed diskshadow the drive where the "Normal" DB is stored at, was exposed but as a RAM instead of a disk ??? never seen something like this….

    Plus i have 2 other DB’s on this drive and those can get backed up with a 3rd party software but it fails on the "Normal" database.

    with vsstester.ps1 the "backup" of the Normal DB failed also:

    http://blogs.technet.com/b/exchange/archive/2013/04/29/troubleshoot-your-exchange-2010-database-backup-functionality-with-vsstester-script.aspx

    i’m not sure if those errors are generic in this case.

    any help welcome !

    regards

    Adam

  37. adam says:

    well it’s a VMware VM (Exchange 2010 stanadlone) the VM is additionaly getting backed up or at least we could see multiple snapshots created during day time (at same time we executed our scheduled backups) … but according to customer there are now disabled.

    what about scenarios that once the full if successful but incremental keeps on failing (CL disabled allready)..

    thank you !

  38. adam says:

    ok, i see. so just to confirm in case no backup is executed and the query get-mailboxDatabase -status | Fl *backup* will not show any database being backed up i could potentially rule a issue with the 3rd party backup vendor out, yes ? or from your experience
    it’s not that case ?

    btw. just to get another thing righbt because it looks like no to be an 100% easy question – does Windows Backup support backup of passive copies in a Exchange 2010 – 2013 DAG ?

    many thanks TIMMCMIC !

  39. adam says:

    ok. i would like to thank you for now and i need to say that it’s the very very first time a got some serious and helpful hints from MS which is sad but good on the other hand 🙂

    if i’ll have any other queries i will post again. hope you won’t mind !

    regards

    Adam 🙂

  40. adam says:

    hi TIMMCMIC

    Me again. I was thinking about what you wrote about the steps you mentioned yesterday:

    1) Gather metadata.
    2) Call preparation
    3) Prepare snapshot
    4) Present snapshot
    5) Validate snapshot
    6) Copy snapshot to media
    7) Invoke backup complete

    Should I see in the debug logs of the 3rd party backup SW this line :

    IVssBackupComponents::BackupComplete.

    I’ve found this :

    “The requester indicates that the backup has completed by calling IVssBackupComponents::BackupComplete.”

    In this place :

    http://msdn.microsoft.com/en-us/library/aa384323(v=vs.85).aspx

    actually what I would like to know if any 3rd party backup SW would use those commands ?

    thanks for confirmation !

    Regards

    Adam

  41. MD2000 says:

    I’m still confused. If the VSS *is* retryable, but retrying won’t get you a backup – what’s the solution? You’re saying the Backup software is misunderstanding the message. It seems to me what’s needed then is a way to clear the warning(?) so the backup
    software thinks the VSS is Ok. How would one clear this warning so the VSS says No Error?

  42. Joshua says:

    Thanks for this post. Old but still great.

  43. AdamJ says:

    hi TIMMCMIC, Thanks for your time on this! So as I understand it, if you find VSS in an error state after a failed backup and without having to manually intervene with services the subsequent backups are ok then that is the backup software vendor that
    is at fault. However, if the VSS is in a error/failed/timeout state following a failed backup but no subsequent backups will succeed unless the VSS related services are restarted etc then that is a VSS / MS error? Is that correct?

  44. Marianne Rooney says:

    Wow, you wrote this three years ago…talk about beating a dead horse BUT – I got the error this morning – ha ha ha! From my logs:
    This is from the backup software log:
    2015-05-02 18:29:34 avexvss Error <10976>: ERROR: Selected writer ‘Microsoft Exchange Replica Writer’ reported an error!
    – Status: 1 (VSS_WS_STABLE)
    – Writer Failure code: 0x800423f3 (The writer experienced a transient error. If the backup process is retried,
    the error may not reoccur

    From Windows AppEventLog, same date and time:
    Microsoft Exchange VSS Writer instance 3f075e65-5ac3-4046-8afe-77cf83402077 failed with error C7FF1004. No log files were truncated for database ‘DB01’.

    The Microsoft Exchange Replication service VSS Writer (Instance 3f075e65-5ac3-4046-8afe-77cf83402077) failed with error C7FF1004 when processing the backup completion event.

    This happened Friday night, and come Monday morning, all of my writers on this server show "Stable, no error".

    AND, later on Friday night there WAS a successful backup of our Public Folders – which are on the same server.

    This goes to show that you are absloutely correct, after the backup of DBO1 failed on this server, the associated VSS Writers must have shown an error and failed status. And an hour later, along comes the backup software, to the same server, looking to backup
    the Public Folders, and no problem, Public Folders are backed up successfully.

    However, the failure of DB01 mystifies me, BUT – I just want to point out that The Exchange Replication VSS Writer must have shown that there was an error, and along came our backup software, and ignored it, then made a successful backup.

    Interestingly, my "Restore" view of my backup and recovery system shows DB01 available for restore from a May 2nd backup – so – some data was indeed backed up. I hope I don’t have to find out what wasn’t…

  45. sarah says:

    "If you get the properties of the server, under MSExchange IS you’ll see entries for VSS where you can change them to MAX. This will give you app log events that show the process in more detail."

    what do you exactly mean with this ?

  46. Tony P says:

    Hello TIMMCMIC

    Just stumbled across this while looking at a MS Exchange FULL backup to TSM which has been failing, and have been reading through the entire blog and your post at 4 May 2015 10:44 PM caught my eye.

    Really struggling to find out why the FULL backup is failing, but decided to check on Exchange DB’s backed up to TSM available for restore, and I see ALL DB’s available for restore even though the backup reported as failed.

    The backup log shows

    Files examined 38521
    Files completed 38500
    Files failed 21

    So unsure what to believe here as to whether my backups are consistent, with some files failing.

    _________________________________________________________________________________

    To you restore tab – this is not uncommon. In many cases the database and log files are backed up and this error is thrown when calling backup complete. So you actually have a valid restoration point from the perspective of the software but not from Exchange
    (ie – backup complete was never successfully called). I’d bet you’d be most likely ok restoring that backup. This happens a lot to when multiple databases are included in a backup job, and it failed on 3 of 4 – the previous 2 already committed to media are
    still restorable from that job even though backup complete was never successfully called on them.