Virtualization, the SAN and why one big RAID 5 array is wrong.

Article
08/23/2011

Today I've had no less than 3 ad hoc conversations about disk sharing for VMs. This isn't about RAID5 versus RAID10, specific performance requirements for SharePoint, or cache for simultaneous read/write requests. This is simply aimed at giving SharePoint admins the knowledge to take to the SAN admins to insure a successful SharePoint deployment via best practices. Our focus is on SQL server in this scenario, but the concepts are applicable. Ready...let's dive in.

First of all, some definitions:

Term	Definition
SAN	Acronym for Storage Area Network. It’s a Physical device used to extend storage space. The SAN is the parent device for all disk IO
Enclosure	Each SAN is made up of enclosures. It’s a physical part of the SAN and is the device holding all the hard drives. Normally, a set of fiber channels, iSCSI channels, HBAs and other miscellaneous hardware is attached to each enclosure.
Hard drive	Regular hard drive you’re familiar with. You know - that spinny thing (or if you're really good, not spinny thing)
RAID Array	Normally RAID 5 when we discuss SANs, a RAID Array is a grouping of Hard drive (all within the same enclosure) to provide fault tolerance. In the event a hard drive goes down, no loss of data is experienced. In a RAID 5, your disk size is measured by: Size of Drives * (Number of Drives - 1). Minimum number of disks for RAID5 is 3 – 2 for the data and one for the parity.
LUN	Acronym for Logical Unit Number. A LUN is logical section of a RAID array and is the actual drive letter that is exposed to Windows.
IOPS	Acronym for Input/Output Operations per Second. IOPS is a measurement of the performance of a disk (or Array). To calculate IOPS, we can use http://blogs.technet.com/b/cotw/archive/2009/03/18/analyzing-storage-performance.aspx, or a much easier method: SQLIO.
SQLIO	A disk benchmarking utility which gives results in IOPS for different loads. https://www.microsoft.com/download/en/details.aspx?displaylang=en&id=20163
Bandwidth	The theoretical maximum of a given resource without any additional load. Imagine bandwidth as a 4 lane highway without traffic with a 70mph speed limit. As law abiding citizens, we can drive up to 70mph all the time. To go 10 miles will take 8.5 minutes every time because we’re always traveling 70 mph.
Throughput	The actual maximum of a given resource with additional load factored in. Imagine throughput as a 4 lane high WITH traffic with a 70mph speed limit. But because of congestion the actual speed we can travel varies from 45mph to 70mph. It’s never the exact same.

Now every SAN administrator has some brochure, or PDF, or something from there SAN vendor that says: For peak performance, create one RAID array of all the hard drives in the enclosure. I'm sure there's some balloon that says "To minimize waste", or "To load balance across multiple hard drives is a good thing!" But this is wrong. But before you go and blow away all your LUNs and RAID arrays, let's examine a scenario:

Let’s assume that we have a SAN, with 4 enclosures. Each enclosure is capable of holding 10 hard drives, and we decide to fill it with 100GB drives each rated at 100 IOPs each. Total possible space is 1TB and our bandwidth is 1,000 IOPS per Enclosure. To maximize our investment (thereby minimizing waste), we follow our SAN
vendor’s recommendation and create one big RAID 5 array and lose 1 disk to the parity calculation. So our available space is 900GB and our bandwidth is 1,000 IOPs.

Next, we decide to deploy SharePoint 2010 via HyperV with all disks on SAN. Our server architecture is 3 servers, 1 SQL, 1 APP and 1 WFE. We decide our drive needs are:

Server	Disk Description	Requirements
SQL	OS Drive	100GB and 50 IOPs
	Data Files	100GB and 250 IOPs
	Transaction Logs	100GB and 250 IOPs
WFE	OS Drive	100GB and 50 IOPs
APP	OS Drive	100GB and 50 IOPs

We send this to the SAN admins and the SAN admin says to themselves: “Self, enclosure #1 has a 900GB capacity and 1,000 IOPs. SharePoint 2010 needs 500GB and 650 IOPS.” And would promptly carve up the RAID 5 array of enclosure #1 into 5 different LUNs: 3 for SQL, 1 for WFE and 1 for APP.

Here’s where the problem arises. A hard drive only has one armature and uses it to read and write, but it can’t do both simultaneously. If the hard drive is writing, and we request a read for some file, then the read gets queued until the disk I/O completes. RAID is a double edged sword and the root of our problem: when using any kind of RAID, the data is dispersed amongst all the drives. On one hand, RAID is a huge performance boost because if we have 20 bits to write and 10 disks to use, then each disk only has to write 2 bits. On the other hand, if we have 20 bits to write, and 10 bits to read, the reading bits will have to wait because all the drives are used for the writes. Now granted, this read and write happens EXTREMELY fast, but the pause is still present and when we're talking about operations per second, they add up quickly.

Now, accordingly to SQL best practices, we split our Data file and Transaction logs to different disks to alleviate this queuing. The transaction logs are very write heavy (1000 writes: 1 read or more). The Data file is very ready heavy (1 write: 500 reads or so). As far as SQL in concerned, they’re on separate disks because we put them on separate LUNs. BUT WAIT: remember that all the LUNS are on one RAID array, AND that when you read or write from a RAID array you utilize all the disks. Thus you haven’t actually split your data files and transaction logs – they’re on the same disks. Now stack on top of that the write heavy statistics of your APP server, and the read heavy load of your WFE, and our throughput plummets. While still under our bandwidth of 1,000 IOPS for the enclosure, the SQL LUN can't write while the WFE LUN is reading and vice versa.

So we’ve identified the underlying problem. What’s the solution? Let’s revisit our server needs:

Server	Disk Description	Requirements
SQL	OS Drive	100GB and 50 IOPs
	Data Files	100GB and 250 IOPs
	Transaction Logs	100GB and 250 IOPs
WFE	OS Drive	100GB and 50 IOPs
APP	OS Drive	100GB and 50 IOPs

RAID 5 is a minimum of 3 drives: 2 data drives and a parity drive. SQL best practices is to break up the OS, Data and Transaction logs onto separate drives. How do we follow best practices and leverage the SAN?

One possible solution would be to breakup into 3 RAID5 volumes like so (note our wasted space is now 300GB instead of 100GB):

RAID 5 volume	Disks	Available space and bandwidth
#1	1-4	300GB and 4,000 IOPs
#2	5-7	200GB and 3,000 IOPs
#3	8-10	200GB and 3,000 IOPs
Total		700GB and 10,000 IOPs

SQL	OS Drive	100GB and 50 IOPs	100GB from RAID5 volume #1
	Data Files	100GB and 250 IOPs	200GB from RAID5 volume #2
	Transaction Logs	100GB and 250 IOPs	200GB from RAID5 volume #3
WFE	OS Drive	100GB and 50 IOPs	100GB from RAID5 volume #1
APP	OS Drive	100GB and 50 IOPs	100GB from RAID5 volume #1

In the proposed solution, we are sacrificing 200GB of disks, but we’re gaining the additional performance of splitting our data files and transactions logs, and SQL is capable of leveraging the entire 3000 IOPS for each LUN. The OS disks are still sharing a RAID array (just like in the rejected solution) so we’re not making the problem any worse, but since we moved the data and transaction to dedicated spindles, we gain performance by reducing the overall load.

Lastly, I’ve fudged the numbers a little bit to make my point. I’m not sure if 50 IOPs is too little or too much for an OS drive. I don’t know if your enclosure has 10 100GB drives with 100 IOPS in RAID5 or 25 300GB drives with 10,000 IOPS in RAID10. And the read/write challenges is nothing new hard drive manufactures nor SAN vendors – great strides have been taken to reduce the performance of a read and write request coming in simultaneously; namely cache.

But here’s the facts: if we plan ahead of time for these scenarios and work with the SAN admins, we can increase the SAN’s overall efficiency. By knowing the loads of other servers on the RAID array, we can intelligently place our loads and prevent any cross server thrashing of the disks. Sometimes the answer may be one big
RAID 10 array, or three smaller RAID 5 array. Proper benchmarking is the key and should be the first step in any new server installation to insure we can identify any issues as soon as possible.

HTH

Virtualization, the SAN and why one big RAID 5 array is wrong.

Additional resources