Why doesn't Office just fix all of the bugs before they ship it

This is the first in a series of 3 related blog entries that will speak to common questions about how we develop and release the next version of Office. Specifically, I’ll address the following topics:

1) Why doesn’t Office just fix all of the bugs before they ship it?

2) How does the ‘triage’ process (used to determine which bugs we should fix) work?

3) Why does Office choose to sustain the products the way that they do?

Part 1: So whydoesn't Office just fix all of the bugs before they ship it?

 

Let’s start with the obvious. We don’t find them all. I may not be the brightest knife in the deck, but I do recognize that and I’m sure it doesn’t come as a big surprise to you either. But why is that the case? I can assure you that it is not due to a lack of effort on our part. Nor is it due to insufficient processes in place to find those bugs. Our software development process pairs developers with testers on day 1 and prioritizes finding and fixing bugs more than anything else throughout the entire process.

 

Now, ask any exterminator worth their mettle and they will tell you that bugs are hard to find. Fortunately, our testers are exceptionally clever. They will find just about every “easy” bug and they will find the vast majority of the “difficult” bugs. Exposing some bugs to the light of day, however, may require using the feature in ways that our development team could never have anticipated. The millions of customers using Office will accomplish their everyday tasks via a daunting range of options. As a result, they will do stuff with our products that we never imagined. While we may have thought we were creating a spiffy new kind of hammer, customers may end up using it more like a serving utensil and ask us why we designed a fork like that. The scenarios are infinite and it would be truly impossible to represent each and every one of them. Even for exceptionally clever testers.

Related to that, software bugs are not as easy to identify as their counterparts in the wild. What one user would consider faulty behavior is perfectly normal and expected behavior to another. There are countless discussions between the development team members on whether or not a given behavior is “by design”. Disagreements are common because there are different perspectives. We do our best to represent what we think will be the most common customer perspective, using data from previous releases, beta releases and numerous customer visits. In the end, representing that perspective is a fickle thing (something about one man’s treasure being another man’s garbage comes to mind) and our conclusions are not always accurate.

And finally, one thing that’s true of all bugs is that they have excellent survival skills. When you kill them, they find a way to come back. Even worse, they can morph into different, more annoying bugs. In software development terms, that’s called a regression. Fixing a bug requires changing code (I have an astounding grasp of the obvious). Changing code introduces the potential for a new bug in a different scenario. Some regressions are worse than the original bug that was fixed. That could be like trading an ‘ant problem’ for a ‘termite problem’. One is a nuisance, the other is disastrous. Generally speaking, it’s a good idea to avoid disasters.

Trying to avoid disaster is another reason our products ship with bugs. That may seem counterintuitive but to put it bluntly, due to the reality of regressions, if we were to fix every bug we found we would never ship. That would be a disaster. And so we will always be faced with this reality - if we are to ship a high quality product on time, we will need to make some very difficult decisions.

For example, if it were possible to fix every bug (and it’s not), is that the most important thing we could do? Or is it more important for customers to be able to take advantage of our new features more than once every 20 years? Is it more important that OEMs, ISVs, and retailers are able to count on our announced ship dates? Is it more important that we respond to competitive functionality in a timely manner? There are always lots and lots of tradeoffs to consider.

Some tradeoffs are obvious, like overall quality being more important than the ship date. If quality is the issue, we will slip the ship date. We’ve done this before (maybe you’ve noticed…). But the vast majority of decisions are not obvious. The tradeoff between fixing a seemingly obscure bug and shipping on time becomes increasingly difficult to make the closer we get to our advertised ship date.

Maybe a real world example of how this process can look would prove helpful. After releasing Office 2007, a customer found an issue that was a big deal to them (garbage, not treasure). Specifically, they found that opening a Powerpoint presentation which contained multiple links to Excel files took a lot longer because of the new way each link updated (we were keeping track of more metadata for each file). Generically speaking, this doesn’t impact many people but due to the nature of this customer’s presentations it was a big hit to them. They requested that we fix the bug as a hotfix. We did, but sometime later, it was discovered that fixing that bug had introduced a new bug that caused a crash when running VBA code in a presentation with external links. That bug was definitely worse and fixing that one jeopardized ever more scenarios so we decided to back out the original fix and helped the customer to determine a suitable workaround for the performance issue.

Our goal is to make the best decisions we can in each and every case with all these variables in mind. To accomplish that goal, we ‘triage’ all bugs to determine whether or not fixing them is the best thing to do. Consistently making that decision well is a complicated process to be sure. It requires the perspective of numerous experienced development team members arguing for and against any given change. More often than not, we make the right choice. Sometimes we make a mistake and choose to fix a bug that introduces a worse one – one that goes undetected. Sometimes we miss the customer scenario that truly captures the essence of the bug and choose not to fix a bug that we should have. While definitely not perfect, the triage process is a good one and will be the topic of part 2 in this series of blogs.

The summary is this. We will move the ship date if quality is the issue, but we will work hard to meet our advertised ship dates. We expect to ship each new release of Office with no bugs that have a noticeable customer impact. Unfortunately we don’t find all of the bugs that our customers will and we will incorrectly triage some bugs as inconsequential when in fact they create a significant issue for our customers. As a result of these two realities, we have worked hard to become a world class ‘servicing’ organization, providing necessary updates to the products that we have already released. We hope that this provides the best value proposition for our customers. The topic of servicing will be part 3 in this blog series. Stay tuned…