An Update on Exchange Server 2010 SP1 Rollup Update 4


The Exchange Sustained Engineering team recently made the decision to recall the June 22, 2011 release of Exchange 2010 SP1 Rollup 4. This was not an action we took lightly and we understand how disruptive this was to customers. We would like to provide you with some details that will give you a deeper understanding of what actually happened and, more importantly, what improvements we are making to prevent this in the future.

  • Q: What actually triggered the recall?

    A: While fixing a bug that prevented deleted public folders from being recovered, we exposed an untested set of conditions with the Outlook client. When moving or copying a folder, Outlook passes a flag on a remote procedure call that instructs the Information Store to open deleted items which haven’t been purged. Our fix inadvertently caused the RPC to skip all content that wasn’t marked for deletion because we were not expecting this flag on the call from Outlook on the copy and move operations.

  • Q: Why didn’t you test this scenario?

    A: The short answer is we thought we did. We didn’t realize we missed a key interaction between Exchange and Outlook. The Exchange team has well over 100,000 automated tests that we use to validate our product before we ship it. With the richness and number of scenarios and behaviors that Exchange supports, automated testing is the only scalable solution. We execute these tests in varying scenarios and conditions repeatedly before we release the software to our customers. We also supplement these tests with manual validation where necessary. The downside of our tests is that they primarily exercise the interfaces we expose and are designed around our specifications. They do test positive and negative conditions to catch unexpected behavior and we did execute numerous folder copy and move tests against the modified code which all passed. What we did not realize is that our tests were not emulating the procedure call as executed by Outlook.

  • Q: Exchange has been around a while, why did this happen now?

    A: In Exchange 2010 we introduced a feature called RPC Client Access. This functionality is responsible for serving as the MAPI endpoint for Outlook clients. It allowed us to abstract client connections away from the Information Store (on Mailbox servers) and cause all Outlook clients to connect to the RPC Client Access service.

    As part of our investigation, we discovered that there was some specific code added to the Exchange 2003 Information Store to handle the procedure call from Outlook using the extra flag. This code was also carried forward into Exchange 2007. But when the Exchange team added the RPC Client Access service to Exchange 2010, that code was not incorporated into the RPC Client Access service because it was mistakenly believed to be legacy Outlook behavior that was no longer required. That, unfortunately, turned out not to be the case. The fact that we were not allowing a deleted public folder to be recovered was masking this new bug completely.

  • Q: Are there other similar issues lurking in RPC Client Access?

    A: We do not believe so. The RPC Client Access functionality has been well-tested at scale and proven to be reliable for the millions of mailboxes hosted in on-premises deployment and in our own Office 365 and Live@EDU services.

  • Q: What are you doing to prevent similar things from happening in the future?

    A: We have conducted a top-to-bottom review of the process we use to triage, develop and validate changes for Rollups and Service Packs and are making several improvements. We have changed the way we evaluate a customer requested fix to ensure that we more accurately identify the risk and usage scenarios that must be validated for a given fix. Recognizing the diversity of clients used to connect to Exchange, we are increasing our client driven test coverage to broaden the usage patterns validated prior to release. Most notably, we are working even closer with our counterparts in Outlook to use their automated test coverage against each of our releases as well. We are also looking to increase coverage for other clients as well.

Kevin Allison
General Manager
Exchange Customer Experience

Comments (22)
  1. Gulab Prasad says:

    I have been talking to the clients and they are not happy the way MS is releasing the broken updates. The concern clients have is, why don't MS test these RU's before releasing it?

    I didn't had any answer for that, they say MS got 90K+ employees and still there is no proper testing of new codes.

    I hope these things will change in future!

  2. Courtenay Snell says:

    Thanks for the clear and open explanation Kevin.

  3. Error503 says:

    Anybody still willing to test these alpha-quality updates? I suggest to wait several weeks to allow early adopters to discover new "features" of this piece of … well… code.

  4. I have to admit the thought of installing this update still makes me nervous, but I do genuinely appreciate the clear and detailed explanation of what happened. Contrary to what others may suggest, as an administrator I can appreciate the complexity of testing code before its release, and there are always things that get missed. That's why continued improvements are made. However, I have generally followed the idea that update rollups should be a collection of bug fixes and should NEVER introduce "new features"… save new features for the service packs.

    I would also point out that while customers are right to be upset about this type of mistake, they must also take responsibility for their own implementations of software into their environments. This is why we are encouraged to run test lab environments and thoroughly test any significant update before putting it into production precisely because every environment is different and generate different scenarios than those Microsoft is testing. Microsoft makes it very clear in the KB articles what has changed and why, so that makes testing an update easier for us… we just need to actually do that testing.

  5. Thanks for the detailed update Kevin.

  6. D says:

    I have taken this update as I was toubleshooting a different problem.

    Should I contact support or wait this out. So far I have not had any problems with the update.

    Thanks,

    D

  7. Jim Sullivan says:

    Thanks for the clear and detailed communication Kevin. The initial reaction to these events is always one of frustration and disappointment, but we like the team's approach towards such issues. We weren't impacted by this bug and although we test all new updates to our satisfaction, we doubt we would've tested for this.

  8. Paul Betts says:

    Thank you for the concise update, better response and handling than normal mistakes by microsft. at least this time we got a post mortem as it were.

  9. Scotte says:

    Just wanted to add my appreciation for the clear, concise, and informative update and explanation.

  10. Daniel Wolf says:

    Thank you for the detailed and transparent update on this situation.

    Posts like this make it clear to customers you are dedicated to finding why mistakes were made and solving them.

  11. Christian Schindler says:

    Thank you for the clear and open information. This proves that the Exchange PG is a mature team. Also good to hear that you will work more close with your peers in the Outlook team! Christian

  12. JohnM says:

    Kevin,

    I have been trying to get the interim patch for RU4 for a couple of days and the Engineer that I'm working with at MS has told me that it is no longer available as the new RU4 is going to be released in a day or two.

    Please either make the interim patch available or release the update.

    Thanks.

    john

  13. JohnM says:

    Sorry … I didn't see the update.

    Many thanks for re-releasing it and for your explanation. I have 2500 mailboxes to move on Sunday night and I've been hanging out for it. As with RU3 I had a lot of problems.

    Thanks again.

  14. Chris D. says:

    Thanks for the upfront disclosure Kevin. Hope to see the results of your efforts and have incident-free RU and service pack releases in the future. In addition to the recent RU-related incidents, the release of Exchange 2010 SP1 is another example of poor handling. Maybe you could have every single member of your Customer Experience team install the next service pack manually before shipping it?

  15. Brian says:

    While buggy patches are inconvenient, I have to shake my head at the "professionals" that blame Microsoft 100% for the problems.  Saying "why wasn't this tested?" should be asked in the mirror first, then directed to Microsoft.  What IT professional installs a major patch into a production Exchange environment without testing properly themselves in a lab environment?  

  16. Gavin says:

    @ Brian – While I agree with your comment in essence – "What IT professional installs a major patch into a production Exchange environment without testing properly themselves in a lab environment?" – I present you with the following question – What IT professional realistically has the resources of the Microsoft Exchange quality assurance division? My point is that if Microsoft Exchange team and their 100,000 automated tests do not uncover an issue, what chance does any "IT Professional" have. I accept in this instance it can be uncovered by typical usage of Outlook, but as a general rule you'd expect to have less resources that the company that develops the software you pay for, otherwise you'd be writing it yourself?

    +1 for appreciating the open explanation.

    My 2c.

  17. bamideleogunmakin@hotmail.com says:

    While the situation must have brought some dire consequences to customres, One thing which has always been lacking has been the "Why". I really commend this blog for providing this information. Additionally we have the assurance such will not be expected again (even with our fingers crossed).

    I hope the same attention can be given to KB974571 for OCS R2.

  18. David says:

    After using E2K10, I've found a few bugs that I would like to report.  I don't want to open an expensive support case but I would like to be a good citizen and show how the bugs can be reproduced.  Does anyone know how I can report these bugs without being charged?

  19. Bharat Suneja [MSFT] says:

    @David: CSS (Support) doesn't charge for reporting a bug.

  20. Adam Winwood says:

    Thanks for the update makes interesting reading.

    Can anyone confirm that the deleted item recovery bug will be fixed in UR5 and also indicate when UR5 will be released, I have read August but it would be helpful to have an idea if that is still the case?

    I have over 60k public folders in use in production and not being able to recover deleted folders without a a full DB restore is a nightmare..

  21. pesos says:

    Brand new exchange installation with ru4v2, seeing this issue on two of four dag members:

    social.technet.microsoft.com/…/339d06d6-0f40-42c2-85c9-d4fac6174d60

Comments are closed.