More On Storage Spaces and ReFS

Since my last post below, I have had quite some time to work with Storage Spaces and ReFS along side of more traditional hardware RAID. I’ve gotten some good perspective on the technology now after having used it and the “old stuff” side by side through good and bad things.

In the defense of good, old-fashioned RAID, generally it works very well at what it does. The traditional argument for it (the offloading of processing power to deal with parity to a hardware device) is, in my opinion, not valid these days with multi-core, high-frequency processors. Now, it’s all about feature set and manageability. That’s where Storage Spaces shines.

I had a power failure a while back in my home lab and it revealed a few things to me. Two things happened to two servers. Well, more like one thing happened to each of the servers: the power failure left the data partitions (where I house my Hyper-V VMs) in a state needing to be rebuilt. One machine had a regular Windows Server mirror of two drives on NTFS. The other had a RAID card with a similar set up. On reboot, the Windows RAID array just started rebuilding automatically. The hardware RAID required some input in order for it to begin rebuilding. Then, suddenly, in the middle of the rebuild, there was an “aftershock” power outage again. The first machine came back just fine and picked up where it left off with Windows RAID doing the re-synch just fine. The hardware device, however, was not so lucky. I was able to mount the drive but unable to get it to a state where it would begin rebuilding on the parity/mirror drives. Even though I was able to mount it, several VHD files were left in an unrecoverable state. I was able to recover the data off of the partitions, but not to start the virtual machines. It took over a week to rebuild the machines in question to restore my lab to what it was.

Fast forward to today.

I have fully implemented Windows Server 2012 in my lab on all of my host servers. Suddenly, I have two large hard drive failures, fortunately on separate machines. Now, I also in the mean time have purchased an external hard drive storage array. It presently contains 4 2TB drives configured in a RAID 10 array yielding about 4TB of total storage. It’s cheap, but does what it’s supposed to. However, one of those drives showed a failure. Here is the one problem I have with this setup: I don’t know what KIND of failure. Was it a bad sector? Several bad sectors? Did it just need a re-initialization and re-synch? Uncertain. My only option was to buy another drive to replace this 3 month old one. I did that, put it in the enclosure in place of the old one, turned everything back on and the rebuild happened smoothly and automatically. Not bad, and it behaved as expected.

Now my other host with an internal RAID card had a disk fail in my VM storage partition of 4 1TB drives, also configured as RAID 10 with about 2TB of storage. The problem I had here, is that once I had replaced the drive in the slot (and with this one I know what went bad – the whole controller failed and the drive just “disappeared” from the card. At least this drive was a little older… ) and issued the rebuild command, it went on for several hours and then failed on what looked to be several bad sectors on the surviving mirror of that segment. I could still access the data and run the VMs, but it was now apparent that I should do something different with the underlying architecture since what I was using didn’t work for me.

What I came up with was using the RAID card as a SATA port “extender” and publishing each individual drive as it’s own physical disk with no RAID and configuring them as a Windows Storage Space configured for mirroring.

First, I needed to get the virtual machines off of the main partition. Enter Hyper-V replication. I had already replicated most of the machines to the server with the 4TB partition, so it was a simple matter of doing a planned failover, reversing the direction of replication, removing the replication from those machines and deleting the VM from the affected machine. I did this until they were all moved over and gone. Once gone, I destroyed the array on the card and hosted the drives as individual disks. I created a new Storage Space with all 4 disks using the simple wizard, which then popped up another wizard which stepped me through creating a disk. I chose to have it configured as a “mirror” volume, which amazingly without me having to specify, Windows just knew that I wanted a RAID 10 mirror/stripe set. It then gave me a new wizard to create the logical partition, which I named the same as the old one and formatted it this time with ReFS. This was critical since what I really want is the server to be able to repair damaged sectors instead of just flagging them as unusable and ignoring them. I might even be able to salvage the 2TB drive that “failed” in my enclosure above with these tools and add it to a new data array.

I’m very pleased about the performance of the volume as well and haven’t noticed any downside performance hit due to Windows doing the parity calculations (which is minimal at this point since it’s a stripe/mirror set). Plus, it’s so much more manageable now due to the ability I have with the native tools to manage the disks. The old RAID configuration tool is now only used to mount the physical disks and thus not needed in case of any rebuild.

Next time the power fails, Windows will just know what to do if there’s any array maintenance to be done.