By now, most of you have heard that Exchange 2007 will only be supported on x64 Hardware. That decision hinged on the impressive performance gains to be had from 64-bit’s expanded access to more RAM. More RAM means less Disk I/O for Exchange and can translate into reduced hardware requirements in the data center. How does it help? I’ll be glad to explain.
First off, let’s step into the classroom and learn about databases. Exchange is a server application built on top of a database, specifically ESE (Extensible Storage Engine), which is also the database engine for Active Directory. Like all database engines, ESE provides tables, columns, rows, and indexes along with a transaction facility including logging, replay, isolation, and recovery.
A common problem with databases is that they are big. They are HUGE when compared against system memory. For example, it’s not uncommon to see Exchange installations with 500GB to 2TB of information on a given server. This means that for the most part, any data that is requested will be on the hard drive only and not in RAM. This requires a disk I/O to read the information. Once that disk I/O completes, the information will be in RAM and not require additional disk I/Os. However, that comes at a price. Since RAM is only so big, some data had to ‘leave’ the cache. Database folks call this ‘eviction’. Sometimes it’s ok, like in the case of a year old email. It’s highly unlikely to be needed again, so it’s probably safe to evict it. However, other data is used more frequently. Some examples include the top 50 items in the Inbox, the calendar and contacts, and rules. That’s a lot of data. While eviction isn’t the end of the world, it does mean we’ll have to re-read the data if it is needed. Not only does it take 1000x longer (memory is usually accessible in nanoseconds while disk is in the 10’s of milliseconds), disk drives are limited in their ability to do work, usually around 100 disk I/Os per second. Imagine keeping your phonebook at work and having to drive to work every time you wanted to make a phone call. This is why RAM is so important; it keeps around commonly used data. Specifically, RAM used to cache database information is usually called a buffer cache.
Another common feature of a database is called checkpoint. Checkpoint simply put is a way to defer writing information to the database until it is convenient. This is perfectly safe since the data is written 2 times: it is written to the transaction logs and also to the database. Since the data is safe in the transaction logs, we can delay writing it to the database file. In the event of a crash, the database engine will read the transaction logs to get the changes and write the pages to the database on startup. This is called replay. This ‘checkpointing’ helps for a number of reasons. First, certain data is updated frequently. For example, critical information about each mailbox and the calendar folder. If we hold off writing, sometimes we can get lucky. If the page hasn’t been written yet and is written again, we’ve saved a write I/O to the database (it was written to the log both times though). This can be HUGE. Taking every piece of trash to the dump when you’re done with it is silly and so is writing data every time it is changed. We get savings in bulk. A second improvement is the fact that it’s highly probable that data located nearby is also changed. The longer we wait, the better the chance there is we can write both pages at the same time, again saving I/O. In Exchange, we control how long to wait by something called the checkpoint depth. The checkpoint depth is basically the size (in megabytes) of log files to keep in memory. Any database page referred to by those log files can be delayed in writing to the database. The default checkpoint depth for Exchange 2000/2003 is 20MB per SG. This 20MB is the size of the logs, not the size of the pages referred to by the logs. You can image this as a ‘card catalog’; the card catalog can fit on a desk but can refer to a whole library of books that take much more space. If the amount of data changed was large, for example all 4KB of every page changed, the size of the database pages dirty and the checkpoint depth would be roughly the same. If, however, only a few bytes changed, the number of pages referred to by the logs could be in the millions and at a 4KB page size could mean 4GB!
This brings up a good question. Why do we not care about the cost of the I/O to the log? Highly scaled up database usually care, but Exchange doesn’t usually hit a bottleneck here. Exchange usually has high costs in the random database I/Os. Log I/O is sequential. Most disk drives handle this much better. Caching controllers (controllers with RAM) usually make this I/O almost free (because it is able to wait and do the operation in bulk).
Exchange 2000 and 2003
So why not just add more memory? Well, it’s not that easy. Exchange 2000 and Exchange 2003 both were 32-bit applications. This means a fundamental limit on how much RAM can be seen at 1 time. Specifically, 4GB. 2^32 (2 times 2, 32 times) is 4GB. However, Windows needs memory for itself and the usually takes 1/2, leaving 2GB. This was modified later to only take 1GB (leaving 3GB for Exchange). However, this is still not that much RAM. So now where does this 3GB go? Some of this needs to be used to hold the Exchange program files and some is needed for processing. This memory also cannot be moved around freely; it has a tendency to fragment, just like a disk. In memory however, defragmentation is difficult and hugely expensive. To prevent fragmentation, Exchange has to be very careful about how it uses memory. As a result, Exchange typically can only depend on about 900MB for the Jet database buffer cache. 900MB may seem like a lot, but when used by 4000 users, each user only gets about 225KB of RAM.
Think about your mailbox. How big are the last 50 items? The average size is usually between 20KB an 40KB. Assuming 20KB, 50 messages is 1MB. That’s 4 times as large as the 225KB allotment and we don’t even have rules, the calendar, or contacts. You can quickly see the problem. We need more RAM.
A major motivation for Exchange to use 64-bit is not ability to crunch bigger numbers, but to get more memory. In fact, we can access a lot more. Most 64-bit computers on the market can address a few hundred GBs of RAM. As mentioned before, more RAM means we can keep data in memory longer and save repeated trips to disk. But doesn’t RAM cost money? Yes it does, but it’s much cheaper than disk up to about 32 GB. Based on this, to optimize for IO reduction we recommend about 5MB of Jet database buffer cache for each user plus 2GB. So for 4000 users, you’d want 20GB + 2GB or about 24GB. This would mean a 20GB of jet cache vs. 1GB in Exchange 2000/2003. For our lab tests, we started at 1.0 IOPS and went to .54, entirely in reduction of reads; a MAJOR savings.
Our next bit of magic was to increase the number of storage groups. Moving from having 1 storage group (logs) for 5 databases to having a 1:1 relationship means more transaction logs (but not more files). Overall, there’s no net change in bytes (same number of users). In Exchange 2000/2003, large servers typically deployed with 1000 users per storage group and the checkpoint depth was 20MB. This corresponds to 20KB of checkpoint per user. This limited the number of pages that could be delayed. By deploying more storage groups, we can delay more pages and get more batching and optimization. Also, the parts of the database that store views can store more messages on a single page. In our lab test (as listed above) this moved our I/O from .54 IOPS to .43 IOPS, stemming from a drop in write I/Os.
We didn’t stop there. Now that the cache was bigger, we also increased the page size from 4KB to 8KB. The page size is the size of ‘packets’ of data that Jet stores on disk. It is the minimum size Exchange will fetch from the disk. The problem with this is that in some cases we might need all 8K (a message body) and other times we might not (a simple message with no body). Overall each page has twice as much data, but we can only have 1/2 as many pages in the Jet cache. Because of this, 8K pages *could possibly* hurt instead of help. Having a larger cache decreases the chances of this significantly by helping keep useful pages longer (minimizing the risk that we don’t have the useful page in memory). The huge positive of 8K pages is that our internal structures in the database (trees) can be shorter. Shorter trees mean less I/Os to get to the pages that store actual user data. We also get the added benefit of storing more in the same place. In Exchange 2000/2003, we stored messages and messages bodies in separate locations, meaning at least 2 disk I/Os. Now, if the message and the body is less than 8K (our data indicates around 75% of messages are less than 8K) we store them in 1 location. This means savings on writes and savings on reads. In our lab tests, this change took us from .43 to .27 IOPS!
These changes helped us to achieve a roughly 70% reduction in I/O. Here’s some information about the test:
1. Based on usage of a real server here at Microsoft
2. We used Exmon to help ‘understand’ the load
3. We built a new load simulation tool similar to Loadsim (but more accurate)
4. We iterated on this until the load looked similar (Exchange 2003 compared with Exchange 2003). We then took our baseline resulting in 1.0 IOPS
5. We then used the exact same load against our ‘modified’ Exchange 2007 test server.
6. The numbers I quote are for 4000 users, moving from 4GB of RAM to 24GB, and increasing the storage groups from 4 to 28.
7. Your mileage will vary depending on user load, mail flow, and the effect of cosmic rays.
I also think it is important to note that our tests were conducted over the summer, on ‘early’ builds of Exchange 2007. These numbers can change as different features of Exchange become ‘active’.