A few basic concepts in disk sizing


While I was at TechEd, I had a great time talking with our customers and hearing about their experiences as Exchange administrators.  One of the areas that have come up in discussion a lot has to do with planning sufficient disk throughput for the back-end exchange servers.  A Lead Program Manager in the Exchange team started pestering me to write it up in a blog, since if so many customers at TechEd had questions, it followed that their might be other Exchange administrators that would be interested as well.

So….I’ve written up the basics of what you need to know about disk sizing for the Exchange server.   I’m not claiming it’s comprehensive (I’m not going to go into SAN technology details, for example, since they differ from vendor to vendor), but it should be enough to:

a) determine if you have sufficient disks

b) detect if you have a disk bottleneck

c) calculate the number of disk I/Os per second per user (also known as IOPS/user)

d) estimate how many disks you need for a new server, based on past user behavior.

First, let me define IO (also written as I/O).  “The amount of IO” is the number of reads and writes to a drive.    The actual bytes that are read or written are less interesting than the number of times the disk head has to move to a location.    There is often confusion around the size of a disk (the number of bytes that can be stored on it) and the throughput (the number of IOs per second that can be read and written).   Throughput is usually measured in IOPS (IOs per second or io/sec).  It’s important to know the maximum throughput (the maximum number of IOs your disks can sustain), because if you exceed that maximum, Hello Outlook Popup!   RPC latencies will quickly go through the roof when maximum disk throughput is exceeded.   When someone is referring to a disk bottleneck, they are referring to a throughput bottleneck, not a limitation of disk space.

Also, for this discussion, when I say IO, I’m usually refering to the physical disk\disk transfers per second to the database drives, but the basic principles can apply to sizing the rest of the drives as well.   The reason I focus on the database drives is that Exchange server makes heavy use of the disks that house the database drives.   For comparison,  the store writes 1/10 the number of IOs to the log drive compared to the database drives.  Even though I focus on database drives, be aware that SMTP queue drives and Exchange temp drives, depending on your company’s email users, can also be heavy consumers of IO, and you will want to make sure you aren’t exceeding the disk maximum throughput of those drives either.

Determining if you have enough disks

How do you know if your drives are healthy?   The simplest way to check is to measure how long it takes for a read and write (referred to as the read and write latency).  Take a look at the physical disk\sec per read and sec per write counters for your database drives.   The server reports this in seconds, but we generally talk about it in ms.   Are the latencies under 20 ms?   If so, excellent.  Your users are probably happy (or at least, complaining about something other than email responsiveness).  If the latencies are larger than 20 ms, it’s time to take a look at your disk usage.   Do the physical disk\disk transfers/sec counters exceed the maximum throughput of your drives?   Aah, now you ask, how do I determine the maximum throughput of my drives?     

The best way to determine maximum throughput is to measure it.  The jetstress tool is an excellent way to measure the maximum throughput of your disks.  The documentation explains how to do this, so I’ll skip that here.   However, to use jetstress, you have to test your disks in a lab (not in a production environment).   So what do you do if you have already have a server in production, and suspect you have exceeded the maximum throughput?  The best thing you can do is make an estimate.  Here’s how I make estimates (there are many tricks, but these are fairly simple):

1.  Most disks can do between 130 to 180 IOPS. 

2.  Exchange typically has a Read-to-Write (R:W) ratio of 3:1 or 2:1.

3.  We recommend that you plan to use less than 80% disk utilization at peak load.


Raid 0 (striping) has the same cost as no raid.  Reads and writes happen once.

Raid 0+1 requires two disk IOs for every write (the mirrored data is written twice)

Raid 5 requires four disk IOs for every write (two reads, two writes to calculate and write parity)


I’m skipping the math unless someone asks for it.  Essentially, this translates into the values in the tables below.  These are the values you should when you estimate how much disk throughput is available for users during peak load.

Tables to lookup recommended maximum disk throughput per disk:

Table 1.  Estimated maximum disk throughput for No Raid or Raid 0

R:W ratio    \   Disk speed

130 IOs per second

180 IOs per second

3:1

104 IOPS

144 IOPS

2:1

104 IOPS

144 IOPS

Table 2. Estimated maximum disk throughput for Raid 0+1 (or Raid 10)

R:W ratio    \   Disk speed

130 IOs per second

180 IOs per second

3:1

83 IOPS

115 IOPS

2:1

78 IOPS

108 IOPS

Table 3: Estimated maximum disk throughput for Raid 5

R:W ratio    \   Disk speed

130 IOs per second

180 IOs per second

3:1

59 IOPS

82 IOPS

2:1

52 IOPS

72 IOPS

I’m too lazy to even use tables, so I take the conservative approach and assume I can safely get a throughput of 80 IOs per second for most disks, in a raid 0+1 configuration (Raid 0+1 is generally recommended for most database drives).

If you have multiple drives (or “spindles”) connected in a raid configuration, multiply the throughput by the number of drives.  Thus, 10 disks in raid 0+1 will safely support a load of 800 IOPS.   I spoke with one customer who recently changed disks.  The company had previously had 6 small disks, and recently replaced them with 3 large disks.  Since then, users had been seeing a lot of Outlook popups while waiting for messages to open, change folders etc.   When the disks were replaced with fewer larger disks, the 3 disks were unable to deliver the io throughput that 6 disks were able to deliver.  The disks were bottlenecked; io latency went up, and so did RPC latency as a consequence.   Solution?   Put more disks in there!   I want to stress that this is not an uncommon scenario – it seems perfectly reasonable to move to fewer larger disks….but you can see here how it can get your server into trouble.

What if your disks throughput is below the maximum, but the latencies are still high?  Sometimes the problem is a configuration problem (eg, max queue depth) or is occuring because you are sharing SAN drives with another application, which is consuming a lot of io bandwidth.  When Exchange is competing with another application for io, user experience suffers.  If you are having a poor latency and think the throughputs are well below the disk maximum throughput, you will have to go back to your disk guru and start troubleshooting.   In general, we don’t recommend sharing database SAN spindles with other applications.   And never never share log drives with any other application (this significantly reduces the throughput).

Detecting disk bottlenecks

It’s pretty simple to tell if you have a disk bottleneck.  If the latencies to your disk drives are greater than 20 ms  (0.02 as measured from physical disk\disk seconds per read and disk seconds per write), then disks are starting to be an issue.   You can survive on disks with 50 ms latencies, but the user experience improves significantly if they are reduced.   On our internal exchange servers, we keep the latency to 10 ms for read IOs, and around 1 ms for write io (write latencies can be very low if you have a battery-back write-back cache).

You should be able to confirm the cause of your bottleneck (exceeding maximum disk throughput) by measuring the physical disk\disk transfers per second and comparing with your estimated maximum throughput.

Calculate your IOPS/user

If you’ve been reading some of the whitepapers or attending talks on Exchange server, you’ve probably seen references to IOPS per user.   Generally, this refers to the number of IO read and write requests to the database drive, divided by the number of users. 

Measure the physical disk\disk transfers per second for all databases for between 20 minutes to 2 hours during your most active time (for example, this is from 9-11 AM on a Monday here at Microsoft).   During this time, also measure the number of active users (MSExchangeIS\Active User Count).   Take an average of these counters.  Sum the disk transfers/sec for each database, divide the first number by the second and… Voila!  You have just calculated the number of IOPS per user.  

Keep in mind that the number of IOPS/user is determined by how active your users are.   You may find that this differs from server to server (and database to database).   Don’t sweat it.   These numbers are used as guidelines, but accurate numbers aren’t always necessary…as long as you build in a little overhead when planning & populating your servers.  However, you can use these numbers to help decide when you want to move users from a busy server to another server.

(Note that, as a general practice, it’s a good idea to always measure the server when it’s at peak load.  When you are sizing your servers, you always need to plan for maximum usage… and then leave a little buffer overhead for those extra special days…like when all the users return from Christmas break).

Estimate how many disks you need for a new server, based on past user behavior.

Now that you know how to measure (via jetstress) or estimate (from above) maximum disk throughput, and you know the IOPS/user, it’s a simple task to plan for how many disks you’ll need for a new server.

Assuming the new users have a similar email usage profile (are using the same clients, have the same percentage of plugins, send about the same mail), then here’s how you go about it:

Calculate the throughput you will need. (multiply the number of users on the new server by the number of IOPS/user)

Divide the throughput by the maximum throughput of the disks you are using (use the numbers from the table above, or the result from jetstress * 0.8.  The numbers in the table above already include the 80% max usage to build in some overhead).    Round up.  This will give you the minimum number of disks that you will need for the server.   Next, divide by the number of databases, and round up.  This will give you the number of disks you need per database (or repeat with storage groups if you databases share the same physical drive).

That’s it!  Oh, I suppose it’s always a good idea to do an example:

Ok, suppose I am hiring 5000 people (growth is good!), and I want to figure out how to size my server.   My current users require 0.4 IOPS per user, and I expect the new guys to be just as hard working as my current employees.  I will need a total of 2000 IOPS.

I’m going to buy fast disks capable of 180 IOPS, which I’m going configure in Raid 0+1.   From the table, I can expect to get around 108 IOs per second.    2000 IOPS/108 IOPS per disk = 18.5.   This will imply that I’m going to need 19 disks, if all IOs were all going to the same place.  But they aren’t of course (backup times would be unwieldy!!!) – I’m planning to have 20 db spread across 4 storage groups.  The databases on the same storage group will share the same disk.  So each storage group disk will need to support 2000/4 = 500 IOPS.   That means each storage group disk will need to have 500/108 = 4.6 disks.  Rounding up shows that I will need 5 disks for each storage group.  So the total number of disks I will need is 5*4 = 20 disks.

Suppose after buying my disks, I test them in the lab and my jetstress tests of these disks only show 120 IOs per second total, which gives me 96 IOPS to play with after I’ve multiplied by 0.8 to give me a 20% safety buffer.    I redo my calculations and find it doesn’t affect my results (because I’d already rounded up earlier).  So, I’m ready to build out my server and add the new users. 

(Note I haven’t calculated the amount of disk space capacity I’ll need…I’ll leave this for the readers…unless I get specific requests.  In many cases, disk capacity is less of an issue because disk space on disks has risen significantly.   For most Exchange customers, the real issue is around disk throughput).

Thanks to you (the reader) for taking the time to read this – I hope you have found the content interesting.  If you have questions about my examples/explanations, send them our way and I’ll do my best to address them.  I’ll take requests for topics as well – I’d love to know what other areas are interesting to you.  

Nicole Allen

Comments (22)
  1. Goran Husman says:

    Excellent information, Nicole!

    Thank you very much!!

    /Göran

    Exchange MVP

  2. keith hanna says:

    Excellent article!!!

    Much appreciated :)

  3. Karan says:

    Thanks Nicole … I’d read the Exchange Disk sizing whitepaper but this post does a nice job of summing it up quite nicely.

  4. Lok says:

    As mentioned, Exchange typically has a Read-to-Write (R:W) ratio of 3:1 or 2:1. is there any perfmon counter that we can calculate the real life read write ratio in the production environment for particular company?

  5. Nicole Allen says:

    To measure your R:W ratio, look at the ratio of LogicalDiskDisk Reads/sec to LogicalDiskDisk Writes/sec for the database drives. (You can look at the same counters in PhysicalDisk if you don’t have the LogicalDisk counters enabled).

    Note that for corporate servers with a large number of users (I am defining large here as 500 users or more), the R:W ratios are usually 3:1 or 2:1. However, servers that have fewer than 500 users will have lower R:W ratios (approaching 0:1 as the number of users decreases and as the amount of data in the database decreases). This is because for servers with few users, much of the user’s data will be in the database cache (in memory), so some of the read actions will be satisfied by data in memory. This reduces the number of read operations. Of course, all of the write operations will still have to be written to disk. Thus, the net effect of having a smaller number of users on the server is that the ratio of R:W goes down.

    Below are the values for R:W ratio of 0:1.

    R:W ratio Disk speed: 130 IOs per second 180 IOs per second

    0:1 104 IOPS 144 IOPS

    Table 2. Estimated maximum disk throughput for Raid 0+1 (or Raid 10):

    R:W ratio Disk speed: 130 IOs per second 180 IOs per second

    0:1 52 IOPS 72 IOPS

    Table 3: Estimated maximum disk throughput for Raid 5:

    R:W ratio Disk speed: 130 IOs per second 180 IOs per second

    0:1 26 IOPS 36 IOPS

  6. Steve McGovern says:

    Do you have any metrics for write latency when SAN synchronous replication is enabled?

  7. Barry says:

    Question on disk latency… While deriving disk latency, if we use the counter "physical disk reads/sec, if we see an average value of 13.882, does that translate to latency of 72ms? (Use 1/13.882=0.072035…) Or I should not use that counter, or my interpretation is totally wrong?

    Thanks,

  8. Barry says:

    ummm, found another reference that points out the counter should be "physical diskaverage disk sec /read" and "physical diskaverage disk sec/write". So, please ignore the last question. Instead, can you confirm these are the counters that you refer to?

    Thanks,

  9. Steve McGovern says:

    Barry you are correct. Disk I/O latency can be identified by using the physical diskaverage disk sec/write & physical diskaverage disk sec/read counters. You should also monitor the MSExchangeISRPC Average Latency

  10. Justin says:

    Great Article, Im running servers with only 5 physical disks, and I dont see my company helping me out with anything more. What are the pro’s and con’s of seperating the logsospageetc… via logical disk but leaving them on the same physical disk. Then putting the databases on a seperate physical disk?

    Thanks,

  11. Nicole Allen says:

    Barry,

    As Steve said, yes, you are correct. It’s important to look at the seconds/write or seconds/read counters.

  12. Nino Bilic says:

    Justin,

    There are close to no advantages to putting things on separate logical disks if they are on the same physical disk. One could say that doing this could help because of possibly less file level fragmentation, but one could also say that disk will do a lot more seeking on the same physical drive.

    There are definitely advantages to putting the database onto a separate physical disk. Keeping the database random IO from the rest of the IO that could be sequential (transaction logs, other databases like SQL etc) is generally a good thing.

  13. Justin says:

    Thanks Nino, so would you say that it is more important to seperate the database or the logs? smtpsystemmtaand Indexes are all random IO, like the databases. The Logs are sequential, so would it be more beneficial for me to just give the logs thier own physical disk?

  14. Nino Bilic says:

    I would definitely say the database. On your typical Exchange server, the database is responsible for about 90% of disk IO as opposed to the logs that are responsible for about 10% of disk IO. So – it is the database that will get disk IO bottlenecked much sooner than the transaction log drive.

  15. Mike Salim says:

    Hi,

    Nice article. You have mentioned the word "database" a few times, could you please expand and define what is a "database" in the Exchange context. particularly the statement "all the databases" – how do I identify "all the databases" ?

    Thanks!

    Mike

  16. Nino Bilic says:

    Hi Mike,

    "The database" in Exchange 200x world really means the .edb and .stm files. Those two files are really the same database – meaning, every database (mailbox or public folder) will have two database files, one .edb and one .stm. It is those database files that are referred as the "database".

  17. Louis says:

    Here is the basic info.

    Server A

    Time slice from 9 am to 6 pm.

    Disk Transfers/Sec

    Min/Avg/Max

    34.400/339.052/1012.903

    MsExchangeIS/Active User Count MsExchangeIS

    Min/Avg/Max

    519/920/1177

    Disk Reads/sec

    Min/Avg/Max

    10.2/206.3/577.8

    Disk Writes/sec

    Min/Avg/Max

    18.467/132.75/766.715

    Total Mailboxes

    563 on the server

    Calculate your IOPS/user

    I’ve seen a couple different options here.

    Worst Case scenario

    Max Disk Transfers/sec / Total Mailboxes on the server

    1012/563

    1.797

    Max Disk Transfers/sec / Max Active user count from MSExchangeIS

    1012/1177

    .85

    From the Exchange 2003 performance and scalability guide

    IOPS/mailbox = (average disk transfer/sec) ÷ (number of mailboxes)

    339/563

    .602

    and then

    Avg Disk Transfers/sec / Avg Active user count from MSExchangeIS

    339/920

    .368

    1. So which one is correct or should be used ?

    The table Tables to lookup recommended maximum disk throughput per disk:

    Table 1. Estimated maximum disk throughput for No Raid or Raid 0

    2. Where do these numbers come from ? I understand the 180, but how do you reduce the achived throughput to 115 or 108 (using RAID 0+1)

    3. The R:W ratio, is it as simple at looking at the averages on Disk Reads/sec and Disk Writes/Sec ?

    So my ratio is roughly about 2:1 from the numbers listed above ?

    Thanks this info is very good!

  18. Nicole Allen says:

    Steve,

    Use the same metrics as a non-sync replications scenario (20ms), as measured from the LogicalDiskAvg. Disk sec/transaction. What really matters to the users is the amount of latency the server is seeing – so you still want to keep Avg disk sec/transaction low.

    -Nicole

  19. David Wilhoit (kidego) Exchange MVP says:

    Nicole,

    I’m experiencing high latency with on my 5.5 servers, due to some poor configuration when the servers were originally built. Although write-back cache is enabled, with battery backup (SmartArray 5i with internal disks), and my avg disk sec/transactions are low, I think the cache is full, and it’s not flushing out to disk fast enough. Disk usage% time can spike at over 1100, and over 2 hours in the A.M. it can average over 500%, with disk queue lengths hovering around 7. Only 4 spindles for the database in a RAID5, and I know that the server is pounded flat. Would increasing the read cache size help me out on this, or am I wasting time until I convince them to buy the SAN?

  20. Anonymous says:

    Based on the questions that we got on another post, it seemed appropriate to address the "Requesting…

  21. Anonymous says:

    Based on the questions that we got on another post, it seemed appropriate to address the "Requesting…

  22. Anonymous says:

    Based on the questions that we got on another post, it seemed appropriate to address the "Requesting…

Comments are closed.