Exchange Mailbox Storage Bricks

This time I wanted to respond to one of the comments from my last blog post " ... apart complementing Perry for a great presentation .. a common mistake done by many companies is mixing two concepts, such as Data Protection and Service Continuity... "

It is a great comment (and not just because he has such great taste in presentations).  The interesting question is whether Exchange has been guilty of muddling some of these concepts in our thinking around storage.  Depending on the context of the charge we either plead enthusiastically GUILTY! or huffily NOT!

What do I mean by that?  I think it is very important for everyone in this space to think through scenario requirements quite rigorously.  Without a clean understanding of requirements and their strict prioritization it is unlikely that a particular deployment will actually meet the needs.  The biggest enemy of creating clean, prioritized requirements is fuzzy thinking about needs such as: confusing service availability with data-loss; handling smoking-hole-datacenter implosions versus handling planned datacenter downtime; listing backups when the actual requirement is retention.  In this context I think the Exchange team has done a pretty good job of teasing out and separating the underlying requirements that customers have--very often by reverse engineering why they have done things from the many interesting deployments they have created.

The other context is implementation.  In order to take cost and complexity out of a system it is important to look at requirements and find commonalities that can be solved with a single elegant architecture.  Sometimes it is much better to implement solutions in refined and elegant but stand-alone ways.  The automatic egg-cooking-coffee-making toaster does have some common elements but should largely be viewed as Rube-Goldberg monstrosity.  However there is also the expensive monstrosity that comes together from continuously bolting together one off pieces to solve many different problems rather than stepping back and implementing a comprehensive solution that has fewer moving parts.  In this context we have been explicitly guilty of muddying the boundaries between solutions for the large set of requirements that exist in the storage space.

The central architectural concept that has been our leading light over the past two releases is the concept of Exchange Mailbox Storage Bricks. The core idea is borrowed from Jim Gray ( https://www.usenix.org/publications/library/proceedings/fast02/gray/index.htm  and https://research.microsoft.com/apps/pubs/default.aspx?id=64151 ) -- essentially that a simple scale-out model based on a basic storage+compute building block that provides linear scaling characteristics and relies on shared nothing clustering is the best approach for cost effectively deploying storage for email servers. 

There are many benefits for this model, but the one I am going to stress here is the win from the shared nothing aspect that you only get by combining service continuity and data protection.   In the shared storage model there is a clean separation between server failover and data-protection.   But this ends up being a very complicated system when you think about the storage network necessary to maintain the multiple links between each compute node and each piece of storage.  Plus there is a huge amount of complexity that needs to be built into the system to guarantee that never, ever, ever, do we allow more than one compute node to write to the same unit of storage.  In the shared nothing space, compute nodes are tied to their storage directly and exclusively.  All that complexity around the fabric goes away and you are architecturally guaranteed to never have more than one writer to a database. 

There is one further win from the brick model that is in many ways the most important which is about the nature of redundancy.  If you have two FULLY independent systems the availability of the overall system can be calculated very simply: you simply concatenate 9's.  For example the availability of at least one of 3 fully independent  systems each with relatively moderate availabilities of 99% is 99.9999%.  In most real-world cases things are not fully independent so your mileage will vary dramatically--for dependent chains your availability is no better than that of the weakest link.  The challenge is to find ways to make things truly independent.  The brilliance of the brick-model is that you end up with individual nodes that are much more independent and the nodes themselves while not particularly reliable can then produce extremely reliable systems.  In the shared storage world you simply do not get this independence benefit. 

I'll be posting another video next week to answer some of the other questions from the data protection blog post - stay tuned!

Perry