Deep Dive: The Storage Pool in Storage Spaces Direct

Hi! I’m Cosmos. Follow me on Twitter @cosmosdarwin.

Review

The storage pool is the collection of physical drives which form the basis of your software-defined storage. Those familiar with Storage Spaces in Windows Server 2012 or 2012R2 will remember that pools took some managing – you had to create and configure them, and then manage membership by adding or removing drives. Because of scale limitations, most deployments had multiple pools, and because data placement was essentially static (more on this later), you couldn’t really expand them once created.

We’re introducing some exciting improvements in Windows Server 2016.

What’s new

With Storage Spaces Direct, we now support up to 416 drives per pool, the same as our per-cluster maximum, and we strongly recommend you use exactly one pool per cluster. When you enable Storage Spaces Direct (as with the Enable-ClusterS2D cmdlet), this pool is automatically created and configured with the best possible settings for your deployment. Eligible drives are automatically discovered and added to the pool and, if you scale out, any new drives are added to the pool too, and data is moved around to make use of them. When drives fail they are automatically retired and removed from the pool. In fact, you really don’t need to manage the pool at all anymore except to keep an eye on its available capacity.

Nonetheless, understanding how the pool works can help you reason about fault tolerance, scale-out, and more. So if you’re curious, read on!

To help illustrate certain key points, I’ve written a script (open-source, available at the end) which produces this view of the pool’s drives, organized by type, by server (‘node’), and by how much data they’re storing. The fastest drives in each server, listed at the top, are claimed for caching.

The storage pool forms the physical basis of your software-defined storage.

The confusion begins: resiliency, slabs, and striping

Let’s start with three servers forming one Storage Spaces Direct cluster.

Each server has 2 x 800 GB NVMe drives for caching and 4 x 2 TB SATA SSDs for capacity.

poolsblog-servers-3

We can create our first volume (‘Storage Space’) and choose 1 TiB in size, two-way mirrored. This implies we will maintain two identical copies of everything in that volume, always on different drives in different servers, so that if hardware fails or is taken down for maintenance, we’re sure to still have access to all our data. Consequently, this 1 TiB volume will actually occupy 2 TiB of physical capacity on disk, its so-called ‘footprint’ on the pool.

Our 1 TiB two-way mirror volume occupies 2 TiB of physical capacity, its ‘footprint’ on the pool.

Our 1 TiB two-way mirror volume occupies 2 TiB of physical capacity, its ‘footprint’ on the pool.

(Storage Spaces offers many resiliency types with differing storage efficiency. For simplicity, this blog will show two-way mirroring. The concepts we’ll cover apply regardless which resiliency type you choose, but two-way mirroring is by far the most straightforward to draw and explain. Likewise, although Storage Spaces offers chassis and/or rack awareness, this blog will assume the default server-level awareness for simplicity.)

Okay, so we have 2 TiB of data to write to physical media. But where will these two tebibytes of data actually land?

You might imagine that Spaces just picks any two drives, in different servers, and places the copies in whole on those drives. Alas, no. What if the volume were larger than the drive size? Okay, perhaps it spans several drives in both servers? Closer, but still no.

What actually happens can be surprising if you’ve never seen it before.

Storage Spaces starts by dividing the volume into many 'slabs', each 256 MB in size.

Storage Spaces starts by dividing the volume into many ‘slabs’, each 256 MB in size.

Storage Spaces starts by dividing the volume into many ‘slabs’, each 256 MB in size. This means our 1 TiB volume has some 4,000 such slabs!

For each slab, two copies are made and placed on different drives in different servers. This decision is made independently for each slab, successively, with an eye toward equilibrating utilization – you can think of it like dealing playing cards into equal piles. This means every single drive in the storage pool will store some copies of some slabs!

The placement decision is made independently for each slab, like dealing playing cards into equal piles.

This can be non-obvious, but it has some real consequences you can observe. For one, it means all drives in all servers will gradually “fill up” in lockstep, in 256 MB increments. This is why we rarely pay attention to how full specific drives or servers are – because they’re (almost) always (almost) the same!

Slabs of our two-way mirrored volume have landed on every drive in all three servers.

Slabs of our two-way mirrored volume have landed on every drive in all three servers.

(For the curious reader: the pool keeps a sprawling mapping of which drive has each copy of each slab called the ‘pool metadata’ which can reach up to several gigabytes in size. It is replicated to at least five of the fastest drives in the cluster, and synchronized and repaired with the utmost aggressiveness. To my knowledge, pool metadata loss has never taken down an actual production deployment of Storage Spaces.)

Why? Can you spell parallelism?

This may seem complicated, and it is. So why do it? Two reasons.

Performance, performance, performance!

First, striping every volume across every drive unlocks truly awesome potential for reads and writes – especially larger sequential ones – to activate many drives in parallel, vastly increasing IOPS and IO throughput. The unrivaled performance of Storage Spaces Direct compared to competing technologies is largely attributable to this fundamental design. (There is more complexity here, with the infamous column count and interleave you may remember from 2012 or 2012R2, but that’s beyond the scope of this blog. Spaces automatically sets appropriate values for these in 2016 anyway.)

(This is also why members of the core Spaces engineering team take some offense if you compare mirroring directly to RAID-1.)

Improved data safety

The second is data safety – it’s related, but worth explaining in detail.

In Storage Spaces, when drives fail, their contents are reconstructed elsewhere based on the surviving copy or copies. We call this ‘repairing’, and it happens automatically and immediately in Storage Spaces Direct. If you think about it, repairing must involve two steps – first, reading from the surviving copy; second, writing out a new copy to replace the lost one.

Bear with me for a paragraph, and imagine if we kept whole copies of volumes. (Again, we don’t.) Imagine one drive has every slab of our 1 TiB volume, and another drive has the copy of every slab. What happens if the first drive fails? The other drive has the only surviving copy. Of every slab. To repair, we need to read from it. Every. Last. Byte. We are obviously limited by the read speed of that drive. Worse yet, we then need to write all that out again to the replacement drive or hot spare, where we are limited by its write speed. Yikes! Inevitably, this leads to contention with ongoing user or application IO activity. Not good.

Storage Spaces, unlike some of our friends in the industry, does not do this.

Consider again the scenario where some drive fails. We do lose all the slabs stored on that drive. And we do need to read from each slab’s surviving copy in order to repair. But, where are these surviving copies? They are evenly distributed across almost every other drive in the pool! One lost slab might have its other copy on Drive 15; another lost slab might have its other copy on Drive 03; another lost slab might have its other copy on Drive 07; and so on. So, almost every other drive in the pool has something to contribute to the repair!

Next, we do need to write out the new copy of each – where can these new copies be written? Provided there is available capacity, each lost slab can be re-constructed on almost any other drive in the pool!

(For the curious reader: I say almost because the requirement that slab copies land in different servers precludes any drives in the same server as the failure from having anything to contribute, read-wise. They were never eligible to get the other copy. Similarly, those drives in the same server as the surviving copy are ineligible to receive the new copy, and so have nothing to contribute write-wise. This detail turns out not to be terribly consequential.)

While this can be non-obvious, it has some significant implications. Most importantly, repairing data faster minimizes the risk that multiple hardware failures will overlap in time, improving overall data safety. It is also more convenient, as it reduces the ‘resync’ wait time during rolling cluster-wide updates or maintenance. And because the read/write burden is spread thinly among all surviving drives, the load on each drive individually is light, which minimizes contention with user or application activity.

Reserve capacity

For this to work, you need to set aside some extra capacity in the storage pool. You can think of this as giving the contents of a failed drive “somewhere to go” to be repaired. For example, to repair from one drive failure (without immediately replacing it), you should set aside at least one drive’s worth of reserve capacity. (If you are using 2 TB drives, that means leaving 2 TB of your pool unallocated.) This serves the same function as a hot spare, but unlike an actual hot spare, the reserve capacity is taken evenly from every drive in the pool.

Reserve capacity gives the contents of a failed drive "somewhere to go" to be repaired.

Reserve capacity gives the contents of a failed drive “somewhere to go” to be repaired.

Reserving capacity is not enforced by Storage Spaces, but we highly recommend it. The more you have, the less urgently you will need to scramble to replace drives when they fail, because your volumes can (and will automatically) repair into the reserve capacity, completely independent of the physical replacement process.

When you do eventually replace the drive, it will automatically take its predecessor’s place in the pool.

Check out our capacity calculator for help with determining appropriate reserve capacity.

Automatic pooling and re-balancing

New in Windows 10 and Windows Server 2016, slabs and their copies can be moved around between drives in the storage pool to equilibrate utilization. We call this ‘optimizing’ or ‘re-balancing’ the storage pool, and it’s essential for scalability in Storage Spaces Direct.

For instance, what if we need to add a fourth server to our cluster?

Add-ClusterNode -Name <Name>

poolsblog-servers-4

The new drives in this new server will be added automatically to the storage pool. At first, they’re empty.

The capacity drives in our fourth server are empty, for now.

After 30 minutes, Storage Spaces Direct will automatically begin re-balancing the storage pool – moving slabs around to even out drive utilization. This can take some time (many hours) for larger deployments. You can watch its progress using the following cmdlet.

Get-StorageJob

If you’re impatient, or if your deployment uses Shared SAS Storage Spaces with Windows Server 2016, you can kick off the re-balance yourself using the following cmdlet.

Optimize-StoragePool -FriendlyName "S2D*"

The storage pool is 're-balanced' whenever new drives are added to even out utilization.

The storage pool is ‘re-balanced’ whenever new drives are added to even out utilization.

Once completed, we see that our 1 TiB volume is (almost) evenly distributed across all the drives in all four servers.

The slabs of our 1 TiB two-way mirrored volume are now spread evenly across all four servers.

The slabs of our 1 TiB two-way mirrored volume are now spread evenly across all four servers.

And going forward, when we create new volumes, they too will be distributed evenly across all drives in all servers.

This can explain one final phenomena you might observe – that when a drive fails, every volume is marked ‘Incomplete’ for the duration of the repair. Can you figure out why?

Conclusion

Okay, that’s it for now. If you’re still reading, wow, thank you!

Let’s review some key takeaways.

  • Storage Spaces Direct automatically creates one storage pool, which grows as your deployment grows. You do not need to modify its settings, add or remove drives from the pool, nor create new pools.
  • Storage Spaces does not keep whole copies of volumes – rather, it divides them into tiny ‘slabs’ which are distributed evenly across all drives in all servers. This has some practical consequences. For example, using two-way mirroring with three servers does not leave one server empty. Likewise, when drives fail, all volumes are affected for the very short time it takes to repair them.
  • Leaving some unallocated ‘reserve’ capacity in the pool allows this fast, non-invasive, parallel repair to happen even before you replace the drive.
  • The storage pool is ‘re-balanced’ whenever new drives are added, such as on scale-out or after replacement, to equilibrate how much data every drive is storing. This ensures all drives and all servers are always equally “full”.

U Can Haz Script

In PowerShell, you can see the storage pool by running the following cmdlet.

Get-StoragePool S2D*

And you can see the drives in the pool with this simple pipeline.

Get-StoragePool S2D* | Get-PhysicalDisk

Throughout this blog, I showed the output of a script which essentially runs the above, cherry-picks interesting properties, and formats the output all fancy-like. That script is included below, and is also available at http://cosmosdarwin.com/Show-PrettyPool.ps1 to spare you the 200-line copy/paste. There is also a simplified version at here which forgoes my extravagant helper functions to reduce running time by about 20x and lines of code by about 2x. 🙂

Let me know what you think!

# Written by Cosmos Darwin, PM
# Copyright (C) 2016 Microsoft Corporation
# MIT License
# 11/2016

Function ConvertTo-PrettyCapacity {
    <#
    .SYNOPSIS Convert raw bytes into prettier capacity strings.
    .DESCRIPTION Takes an integer of bytes, converts to the largest unit (kilo-, mega-, giga-, tera-) that will result in at least 1.0, rounds to given precision, and appends standard unit symbol.
    .PARAMETER Bytes The capacity in bytes.
    .PARAMETER UseBaseTwo Switch to toggle use of binary units and prefixes (mebi, gibi) rather than standard (mega, giga).
    .PARAMETER RoundTo The number of decimal places for rounding, after conversion.
    #>

    Param (
        [Parameter(
            Mandatory = $True,
            ValueFromPipeline = $True
            )
        ]
    [Int64]$Bytes,
    [Int64]$RoundTo = 0,
    [Switch]$UseBaseTwo # Base-10 by Default
    )

    If ($Bytes -Gt 0) {
        $BaseTenLabels = ("bytes", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
        $BaseTwoLabels = ("bytes", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB")
        If ($UseBaseTwo) {
            $Base = 1024
            $Labels = $BaseTwoLabels 
        }
        Else {
            $Base = 1000
            $Labels = $BaseTenLabels
        }
        $Order = [Math]::Floor( [Math]::Log($Bytes, $Base) )
        $Rounded = [Math]::Round($Bytes/( [Math]::Pow($Base, $Order) ), $RoundTo)
        [String]($Rounded) + $Labels[$Order]
    }
    Else {
        0
    }
    Return
}


Function ConvertTo-PrettyPercentage {
    <#
    .SYNOPSIS Convert (numerator, denominator) into prettier percentage strings.
    .DESCRIPTION Takes two integers, divides the former by the latter, multiplies by 100, rounds to given precision, and appends "%".
    .PARAMETER Numerator Really?
    .PARAMETER Denominator C'mon.
    .PARAMETER RoundTo The number of decimal places for rounding.
    #>

    Param (
        [Parameter(Mandatory = $True)]
            [Int64]$Numerator,
        [Parameter(Mandatory = $True)]
            [Int64]$Denominator,
        [Int64]$RoundTo = 1
    )

    If ($Denominator -Ne 0) { # Cannot Divide by Zero
        $Fraction = $Numerator/$Denominator
        $Percentage = $Fraction * 100
        $Rounded = [Math]::Round($Percentage, $RoundTo)
        [String]($Rounded) + "%"
    }
    Else {
        0
    }
    Return
}

Function Find-LongestCommonPrefix {
    <#
    .SYNOPSIS Find the longest prefix common to all strings in an array.
    .DESCRIPTION Given an array of strings (e.g. "Seattle", "Seahawks", and "Season"), returns the longest starting substring ("Sea") which is common to all the strings in the array. Not case sensitive.
    .PARAMETER Strings The input array of strings.
    #>

    Param (
        [Parameter(
            Mandatory = $True
            )
        ]
        [Array]$Array
    )

    If ($Array.Length -Gt 0) {

        $Exemplar = $Array[0]

        $PrefixEndsAt = $Exemplar.Length # Initialize
        0..$Exemplar.Length | ForEach {
            $Character = $Exemplar[$_]
            ForEach ($String in $Array) {
                If ($String[$_] -Eq $Character) {
                    # Match
                }
                Else {
                    $PrefixEndsAt = [Math]::Min($_, $PrefixEndsAt)
                }
            }
        }
        # Prefix
        $Exemplar.SubString(0, $PrefixEndsAt)
    }
    Else {
        # None
    }
    Return
}

Function Reverse-String {
    <#
    .SYNOPSIS Takes an input string ("Gates") and returns the character-by-character reversal ("setaG").
    #>

    Param (
        [Parameter(
            Mandatory = $True,
            ValueFromPipeline = $True
            )
        ]
        $String
    )

    $Array = $String.ToCharArray()
    [Array]::Reverse($Array)
    -Join($Array)
    Return
}

Function New-UniqueRootLookup {
    <#
    .SYNOPSIS Creates hash table that maps strings, particularly server names of the form [CommonPrefix][Root][CommonSuffix], to their unique Root.
    .DESCRIPTION For example, given ("Server-A2.Contoso.Local", "Server-B4.Contoso.Local", "Server-C6.Contoso.Local"), returns key-value pairs:
        {
        "Server-A2.Contoso.Local" -> "A2"
        "Server-B4.Contoso.Local" -> "B4"
        "Server-C6.Contoso.Local" -> "C6"
        }
    .PARAMETER Strings The keys of the hash table.
    #>

    Param (
        [Parameter(
            Mandatory = $True
            )
        ]
        [Array]$Strings
    )

    # Find Prefix

    $CommonPrefix = Find-LongestCommonPrefix $Strings

    # Find Suffix

    $ReversedArray = @()
    ForEach($String in $Strings) {
        $ReversedString = $String | Reverse-String
        $ReversedArray += $ReversedString
    }

    $CommonSuffix = $(Find-LongestCommonPrefix $ReversedArray) | Reverse-String

    # String -> Root Lookup

    $Lookup = @{}
    ForEach($String in $Strings) {
        $Lookup[$String] = $String.Substring($CommonPrefix.Length, $String.Length - $CommonPrefix.Length - $CommonSuffix.Length)
    }

    $Lookup
    Return
}

### SCRIPT... ###

$Nodes = Get-StorageSubSystem Cluster* | Get-StorageNode
$Drives = Get-StoragePool S2D* | Get-PhysicalDisk

$Names = @()
ForEach ($Node in $Nodes) {
    $Names += $Node.Name
}

$UniqueRootLookup = New-UniqueRootLookup $Names

$Output = @()

ForEach ($Drive in $Drives) {

    If ($Drive.BusType -Eq "NVMe") {
        $SerialNumber = $Drive.AdapterSerialNumber
        $Type = $Drive.BusType
    }
    Else { # SATA, SAS
        $SerialNumber = $Drive.SerialNumber
        $Type = $Drive.MediaType
    }

    If ($Drive.Usage -Eq "Journal") {
        $Size = $Drive.Size | ConvertTo-PrettyCapacity
        $Used = "-"
        $Percent = "-"
    }
    Else {
        $Size = $Drive.Size | ConvertTo-PrettyCapacity
        $Used = $Drive.VirtualDiskFootprint | ConvertTo-PrettyCapacity
        $Percent = ConvertTo-PrettyPercentage $Drive.VirtualDiskFootprint $Drive.Size
    }

    $Node = $UniqueRootLookup[($Drive | Get-StorageNode -PhysicallyConnected).Name]

    # Pack

    $Output += [PSCustomObject]@{
        "SerialNumber" = $SerialNumber
        "Type" = $Type
        "Node" = $Node
        "Size" = $Size
        "Used" = $Used
        "Percent" = $Percent
    }
}

$Output | Sort Used, Node | FT