Understanding Exchange Server 2007 I/O improvements from 64 bit


By now, most of you have heard that Exchange 2007 will only be supported on x64 Hardware.  That decision hinged on the impressive performance gains to be had from 64-bit’s expanded access to more RAM.  More RAM means less Disk I/O for Exchange and can translate into reduced hardware requirements in the data center.  How does it help?  I’ll be glad to explain.

Database 101

First off, let’s step into the classroom and learn about databases.  Exchange is a server application built on top of a database, specifically ESE (Extensible Storage Engine), which is also the database engine for Active Directory.  Like all database engines, ESE provides tables, columns, rows, and indexes along with a transaction facility including logging, replay, isolation, and recovery. 

A common problem with databases is that they are big.  They are HUGE when compared against system memory.  For example, it’s not uncommon to see Exchange installations with 500GB to 2TB of information on a given server.  This means that for the most part, any data that is requested will be on the hard drive only and not in RAM.  This requires a disk I/O to read the information.  Once that disk I/O completes, the information will be in RAM and not require additional disk I/Os.  However, that comes at a price.  Since RAM is only so big, some data had to ‘leave’ the cache.  Database folks call this ‘eviction’.  Sometimes it’s ok, like in the case of a year old email.  It’s highly unlikely to be needed again, so it’s probably safe to evict it. However, other data is used more frequently.  Some examples include the top 50 items in the Inbox, the calendar and contacts, and rules.  That’s a lot of data.  While eviction isn’t the end of the world, it does mean we’ll have to re-read the data if it is needed.  Not only does it take 1000x longer (memory is usually accessible in nanoseconds while disk is in the 10’s of milliseconds), disk drives are limited in their ability to do work, usually around 100 disk I/Os per second.  Imagine keeping your phonebook at work and having to drive to work every time you wanted to make a phone call.  This is why RAM is so important; it keeps around commonly used data.  Specifically, RAM used to cache database information is usually called a buffer cache.

Another common feature of a database is called checkpoint.  Checkpoint simply put is a way to defer writing information to the database until it is convenient.  This is perfectly safe since the data is written 2 times: it is written to the transaction logs and also to the database.  Since the data is safe in the transaction logs, we can delay writing it to the database file.  In the event of a crash, the database engine will read the transaction logs to get the changes and write the pages to the database on startup.  This is called replay.  This ‘checkpointing’ helps for a number of reasons.  First, certain data is updated frequently.  For example, critical information about each mailbox and the calendar folder.  If we hold off writing, sometimes we can get lucky.  If the page hasn’t been written yet and is written again, we’ve saved a write I/O to the database (it was written to the log both times though).  This can be HUGE.  Taking every piece of trash to the dump when you’re done with it is silly and so is writing data every time it is changed.  We get savings in bulk.  A second improvement is the fact that it’s highly probable that data located nearby is also changed.  The longer we wait, the better the chance there is we can write both pages at the same time, again saving I/O.  In Exchange, we control how long to wait by something called the checkpoint depth.  The checkpoint depth is basically the size (in megabytes) of log files to keep in memory.  Any database page referred to by those log files can be delayed in writing to the database.  The default checkpoint depth for Exchange 2000/2003 is 20MB per SG.  This 20MB is the size of the logs, not the size of the pages referred to by the logs.  You can image this as a ‘card catalog’; the card catalog can fit on a desk but can refer to a whole library of books that take much more space.  If the amount of data changed was large, for example all 4KB of every page changed, the size of the database pages dirty and the checkpoint depth would be roughly the same.  If, however, only a few bytes changed, the number of pages referred to by the logs could be in the millions and at a 4KB page size could mean 4GB!

This brings up a good question.  Why do we not care about the cost of the I/O to the log?  Highly scaled up database usually care, but Exchange doesn’t usually hit a bottleneck here.  Exchange usually has high costs in the random database I/Os.  Log I/O is sequential.  Most disk drives handle this much better.  Caching controllers (controllers with RAM) usually make this I/O almost free (because it is able to wait and do the operation in bulk). 

Exchange 2000 and 2003

So why not just add more memory?  Well, it’s not that easy.  Exchange 2000 and Exchange 2003 both were 32-bit applications.  This means a fundamental limit on how much RAM can be seen at 1 time.  Specifically, 4GB. 2^32 (2 times 2, 32 times) is 4GB.  However, Windows needs memory for itself and the usually takes 1/2, leaving 2GB.  This was modified later to only take 1GB (leaving 3GB for Exchange).  However, this is still not that much RAM.  So now where does this 3GB go?  Some of this needs to be used to hold the Exchange program files and some is needed for processing.  This memory also cannot be moved around freely; it has a tendency to fragment, just like a disk.  In memory however, defragmentation is difficult and hugely expensive.  To prevent fragmentation, Exchange has to be very careful about how it uses memory.  As a result, Exchange typically can only depend on about 900MB for the Jet database buffer cache.  900MB may seem like a lot, but when used by 4000 users, each user only gets about 225KB of RAM. 

Think about your mailbox.  How big are the last 50 items?  The average size is usually between 20KB an 40KB.  Assuming 20KB, 50 messages is 1MB.  That’s 4 times as large as the 225KB allotment and we don’t even have rules, the calendar, or contacts.  You can quickly see the problem.  We need more RAM.

Exchange 2007

A major motivation for Exchange to use 64-bit is not ability to crunch bigger numbers, but to get more memory.  In fact, we can access a lot more.  Most 64-bit computers on the market can address a few hundred GBs of RAM.  As mentioned before, more RAM means we can keep data in memory longer and save repeated trips to disk. But doesn’t RAM cost money?  Yes it does, but it’s much cheaper than disk up to about 32 GB.  Based on this, to optimize for IO reduction we recommend about 5MB of Jet database buffer cache for each user plus 2GB.  So for 4000 users, you’d want 20GB + 2GB or about 24GB.  This would mean a 20GB of jet cache vs. 1GB in Exchange 2000/2003.  For our lab tests, we started at 1.0 IOPS and went to .54, entirely in reduction of reads; a MAJOR savings.

Our next bit of magic was to increase the number of storage groups.  Moving from having 1 storage group (logs) for 5 databases to having a 1:1 relationship means more transaction logs (but not more files).  Overall, there’s no net change in bytes (same number of users).  In Exchange 2000/2003, large servers typically deployed with 1000 users per storage group and the checkpoint depth was 20MB.  This corresponds to 20KB of checkpoint per user.  This limited the number of pages that could be delayed.  By deploying more storage groups, we can delay more pages and get more batching and optimization.  Also, the parts of the database that store views can store more messages on a single page.  In our lab test (as listed above) this moved our I/O from .54 IOPS to .43 IOPS, stemming from a drop in write I/Os.

We didn’t stop there.  Now that the cache was bigger, we also increased the page size from 4KB to 8KB.  The page size is the size of ‘packets’ of data that Jet stores on disk.  It is the minimum size Exchange will fetch from the disk.  The problem with this is that in some cases we might need all 8K (a message body) and other times we might not (a simple message with no body).  Overall each page has twice as much data, but we can only have 1/2 as many pages in the Jet cache.  Because of this, 8K pages *could possibly* hurt instead of help.  Having a larger cache decreases the chances of this significantly by helping keep useful pages longer (minimizing the risk that we don’t have the useful page in memory).  The huge positive of 8K pages is that our internal structures in the database (trees) can be shorter.  Shorter trees mean less I/Os to get to the pages that store actual user data.  We also get the added benefit of storing more in the same place.  In Exchange 2000/2003, we stored messages and messages bodies in separate locations, meaning at least 2 disk I/Os.  Now, if the message and the body is less than 8K (our data indicates around 75% of messages are less than 8K) we store them in 1 location.  This means savings on writes and savings on reads.  In our lab tests, this change took us from .43 to .27 IOPS! 

These changes helped us to achieve a roughly 70% reduction in I/O.  Here’s some information about the test:

1. Based on usage of a real server here at Microsoft

2. We used Exmon to help ‘understand’ the load

3. We built a new load simulation tool similar to Loadsim (but more accurate)

4. We iterated on this until the load looked similar (Exchange 2003 compared with Exchange 2003).  We then took our baseline resulting in 1.0 IOPS

5. We then used the exact same load against our ‘modified’ Exchange 2007 test server.

6. The numbers I quote are for 4000 users, moving from 4GB of RAM to 24GB, and increasing the storage groups from 4 to 28.

7. Your mileage will vary depending on user load, mail flow, and the effect of cosmic rays.

I also think it is important to note that our tests were conducted over the summer, on ‘early’ builds of Exchange 2007.  These numbers can change as different features of Exchange become ‘active’. 

– Chris Mitchell


Comments (19)
  1. Brian.Kronberg says:

    "The numbers I quote are for 4000 users, moving from 4GB of RAM to 24GB, and increasing the storage groups from 4 to 28."

    At what user size quota?  250 MB or 2 GB?  What was the average mailbox size?

  2. Brian.Kronberg,

       The test was at 200MB mailboxes.

  3. Juergen says:

    Do the 5 MB in the sizing rule of thumb “2 GB + 5 MB per user” refer to a user with an active RPC connection, or per user with a mailbox hosted on the server?

    Which topics have an influence on the required memory size per “user”: user activity (frequency of sending or receiving mails) , number of open connections per mailbox, number of items stored in special folders like the inbox, overall size of a mailbox, or size of the database file?

  4. David says:

    When will be available the "new Loadsim" for Exchange 2007?

  5. Josh Maher says:

    How much lag time was required to get the 20GB of jet cache into memory?

  6. Lanlogic says:

    What about small and mid-size customers with less than 250 employees ?  Will Exchange 2007 really use 64-bit hardware and more than 2-4GB of RAM if there aren’t that many users ?  We know what SBS can do up to 50-75 users, but for clients that need to go beyond SBS, will 64-bit hardware and tons of RAM really make a huge difference ?

  7. Robert Snyder says:

    So, with the increase of performance to 70% on the DB luns and adding up to 50 SG’s with a single DB, what happens to log performance? More specifically, is the conversion factor for allocation of LUNS between the DB and LOGS on 2K7 the same as 2K3?

    If you’re looking at a 1 IOP environment in 2K3 it will equate to a .3 IOP (based on your recommendations and data). Then to calculate spindle count for DB luns would be:

    5000 users x 2GB Mailbox Size x 1.5 for Data compensation and growth x 2 (VRaid1)/ 300GB disk size

    = 100 Total Disks (DB LUN aggregate size and VRaid1)

    Factor out the VRaid1 overhead (divide by 2) and we end up with 50 Total (usable) disks for Aggregate Storage

    = 15TB Total aggregate Storage

    Now, in E2003 to calculate the Spindle count for LOGS we would take:

    #DB spindles/10 = #LOG IOPS

    then

    # LOG IOPS/Disk IOP Capacity = Total # Spindles for LOG disk Group

    Is this still the same in E2K7?

  8. David: the "new Loadsim" will be available on the web around the time when Exchange 2007 is shipped.

  9. Juergen:  So first, we won’t require 5MB per user, but for some profiles we’ll recommend it.  We’re in the process of finalizing our system requirements and recommendations.  The amount of memory per user we recommend varies with profile.  As an admin/architect, the ability (but not requirement) to add RAM will give you a powerful lever in pricing and building your Exchange Server.

    A few main drivers are the amount of mailflow (send and receive), the concurrency (how many users are accessing the system at the same time), and applications used.  There are MANY factors that also contribute, but usually to a lesser degree like mailbox size and folder size.

    Mailflow is the biggest, to no suprise.  Mailbox size drives IOPs when there are applications that perform lots of searches or frequently download the contents of the mailbox.  Using Cached Mode reduces a large burden on the server of searching and sorting lots of items.  Folder size is important for the previous reason and to make sure Cached Mode experience remains positive.

  10. Curtis Johnstone says:

    Good article.

    re: "2. We used Exmon to help ‘understand’ the load "

    So Exmon runs on Exchange 2007?

  11. Curtis Johnstone: I don’t beleive the released version of Exmon runs on Exchange 2007.  There have been a few updates.  I will look into when we’ll release those updates.

  12. Chris Mictchell says:

    Robert Snyder:

    The amount of logging IO in E2K7 vs. E2K3 should be within 10-15% (from memory).  Therefore you’ll probably need the same IOPS capacity for log drives as in E2K3.  As for the ratio for E2K7, I’ll have to get back to you on that (I need to look at data).

  13. Josh Maher:

    It depends on how much IOPS you’ve provisioned for.  Now, obviously, we’re reducing the IOPS requirement by adding RAM.  Given the specifics of this test (from memory), I calculate we hit 1/2 cache in use (10GB) within about 20 minutes, then next have takes a little longer.  

  14. AML says:

    Thanks Chris, interesting article.

    The improvements in disk performance in part rely on increasing the number of storage groups used.  In the past the general recommendation was to have separate physical disks for each set of logs (i.e. storage group).  Is this still the case with Exchange 2007?  Thanks in advance.

  15. Jim McBee says:

    Very interesting article.  It would nice to see a comparison of the IOPS requirements for E2K3 as compared with the projected IOPS requirements for E2K7 given light, average, heavy, and large user types.

  16. RPM says:

    I have to ask, why didn’t the Exchange team pursue a switch to the SQL 2005 storage engine? I know this change has been rumored for many years.

    In my mind, the SQL engine has always seemed far more performant and memory efficient than ESE. Virtual memory use is a prime example: on a 4 GB 32-bit server, I get nearly 3 GB of buffer cache with SQL, but less than 1 GB with Exchange 2003.

    With the addition of VARCHAR(MAX), full-text search, and native XML data types, surely it makes sense for MSFT to have just one database engine to support?

  17. Ananda Sankaran says:

    Chris,

    Well written article! straight to the relevant stuff…

    Additonal memory made available by x64 defintely helps reduce reads.

    Interesting to note the checkpoint depth limit per storage group is still 20MB. In contrast with SQL server, where the checkpoint limit is specified as time (recovery interval parameter). Default interval is approx 1 min (i.e) SQL checkpoints its dirty pages every min or so. Thus the amount of flushed pages depends on the available memory size (i.e buffer cache size) and the write % in the load.

    It will be interesting to know the read/write split of your load. I assume the write % of the load to be less to effect the .54 to .43 IOPS improvement.

  18. Anonymous says:

    In order to assist customers in designing their storage layout for Exchange 2007 (especially after…

  19. Anonymous says:

    In order to assist customers in designing their storage layout for Exchange 2007, we have put together

Comments are closed.