Hello, Claus here again.
One of the most important aspects when creating a volume is to choose the resiliency settings. The purpose of resiliency is to provide resiliency in case of failures, such as failed drive or a server failure. It also enables data availability when performing maintenance, such as server hardware replacement or operating system updates. Storage Spaces Direct supports two resiliency types; mirror and parity.
Mirror resiliency is relatively simple. Storage Spaces Direct generates multiple block copies of the same data. By default, it generates 3 copies. Each copy is stored on a drive in different servers, providing resiliency to both drive and server failures. The diagram shows 3 data copies (A, A’ and A’’) laid out across a cluster with 4 servers.
Figure 1 3-copy mirror across 4 servers
Assuming there is a failure on the drive in server 2 where A’ is written. A’ is regenerated from reading A or A’’ and writing a new copy of A’ on another drive in server 2 or any drive in server 3. A’ cannot be written to drives in server 1 or server 4 since it is not allowed to have two copies of the same data in the same server.
If the admin puts a server in maintenance mode, the corresponding drives also enters maintenance mode. While maintenance mode suspends IO to the drives, the administrator can still perform drive maintenance tasks, such as updating drive firmware. Data copies stored on the server in maintenance mode will not be updated since IOs are suspended. Once the administrator takes the server out of maintenance mode, the data copies on the server will be updated using data copies from other servers. Storage Spaces Direct tracks which data copies are changed while the server is in maintenance mode, to minimize data resynchronization.
Mirror resiliency is relatively simple, which means it has great performance and does not have a lot of CPU overhead. The downside to mirror resiliency is that it is relatively inefficient, with 33.3% storage efficiency when storing 3 full copies of all data.
Parity resiliency is much more storage efficient compared to mirror resiliency. Parity resiliency uses parity symbols across a larger set of data symbols to drive up storage efficiency. Each symbol is stored on a drive in different servers, providing resiliency to both drive and server failures. Storage Spaces Direct requires at least 4 servers to enable parity resiliency. The diagram shows two data symbols (X1 and X2) and two parity symbols (P1 and P2) laid out across a cluster with 4 servers.
Figure 2 Parity resiliency across 4 servers
Assuming there is a failure on the drive in server 2 where X2 is written. X2 is regenerated from reading the other symbols (X1, P1 and P2), recalculate the value of X2 and write X2 on another drive in server 2. X2 cannot be written to drives in others servers, since it is not allowed to have two symbols in the same symbol set in the same server.
Parity resiliency works similar to mirror resiliency when a server is in maintenance mode.
Parity resiliency has better storage efficiency than mirror resiliency. With 4 servers the storage efficiency is 50%, and it can be as high as 80% with 16 servers. The downside of parity resiliency is twofold:
- Performing data reconstruction involves all of the surviving symbols. All symbols are read, which is extra storage IO, Lost symbols are recalculated, which incurs expensive CPU cycles and written back to disk.
- Overwriting existing data involves all symbols. All data symbols are read, data is updated, parity is recalculated, and all symbols are written. This is also known as Read-Modify-Write and incurs significant storage IO and CPU cycles.
Local Reconstruction Codes
Storage Spaces Direct uses Reed-Solomon error correction (aka erasure coding) for parity calculation in smaller deployments for the best possible efficiency and resiliency to two simultaneous failures. A cluster with four servers has 50% storage efficiency and resiliency to two failures. With larger clusters storage efficiency is increased as there can be more data symbols without increasing the number of parity symbols. On the flip side, data reconstruction becomes increasingly inefficient as the total number of symbols (data symbols + parity symbols) increases, as all surviving symbols will have to be read in order to calculate and regenerate the missing symbol(s). To address this, Microsoft Research invented Local Reconstruction Codes, which is being used in Microsoft Azure and Storage Spaces Direct.
Local Reconstruction Codes (LRC) optimizes data reconstruction for the most common failure scenario, which is a single drive failure. It does so by grouping the data symbols and calculate a single (local) parity symbol across the group using simple XOR. It then calculates a global parity across all the symbols. The diagram below shows LRC in a cluster with 12 servers.
Figure 3 LRC in a cluster with 12 servers
In the above example we have 11 symbols, 8 data symbols represented by X1, X2, X3, X4, Y1, Y2, Y3 and Y4, 2 local parity symbols represented by PX and PY, and finally one global parity symbol represented by Q. This particular layout is also sometimes described as (8,2,1) representing 8 data symbols, 2 groups and 1 global parity.
Inside each group the parity symbol is calculated as simple XOR across the data symbols in the group. XOR is not a very computational intensive operation and thus requires few CPU cycles. Q is calculated using the data symbols and local parity symbols across all the groups. In this particular configuration, the storage efficiency is 8/11 or ~72%, as there are 8 data symbols out of 11 total symbols.
As mentioned above, in storage systems a single failure is more common than multiple failures and LRC is more efficient and incurs less storage IO when reconstructing data in the single device failure scenario and even some multi-failure scenarios.
Using the example from figure 3 above:
What happens if there is one failure, e.g. the disk that stores X2 fails? In that case X2 is reconstructed by reading X1, X3, X4, and PX (four reads), perform XOR operation (simple), and write X2 (one write) on a different disk in server 2. Notice that none of the Y symbols or the global parity Q are read or involved in the reconstruction.
What happens if there are two simultaneous failures, e.g. the disk that stores X1 fails and the disk that stores Y2 also fails. In this case, because the failures occurred in two different groups, X1 is reconstructed by reading X2, X3, X4 and PX (four reads), perform XOR operation, and write X1 (one write) on a different disk in server 1. Similarly, Y2 is reconstructed by reading Y1, Y3, Y4 and PY (four reads), perform XOR operation, and write Y2 (one write) to a different disk in server 5. A total of eight reads and two writes. Notice that only simple XOR was involved in data reconstruction thus reducing the pressure on the CPU.
What happens if there are two failures in the same group, e.g. the disks that stores X1 and X2 have both failed. In this case X1 is reconstructed by reading X3, X4 PX, Y1, Y2, Y3, Y4 and Q (8 reads), perform erasure code computation and write X1 to a different disk in server 1. It is not necessary to read PY, since it can be calculated it from knowing Y1, Y2, Y4 and Y4. Once X1 is reconstructed, X2 can be reconstructed using the same mechanism described for one failure above, except no additional reads are needed.
Notice how, in the example above, one server does not have symbols? This configuration allows reconstruction of symbols even in the case where a server has malfunctioned and is permanently retired, after which the cluster effective will have only 11 servers until a replacement server is added to the cluster.
The number of data symbols in a group depends on the cluster size and the drive types being used. Solid state drives perform better, so the number of data symbols in a group can be larger. The below table, outlines the default erasure coding scheme (RS or LRC) and the resulting efficiency for hybrid and all-flash storage configuration in various cluster sizes.
SSD + HDD
|4||RS 2+2||50%||RS 2+2||50%|
|5||RS 2+2||50%||RS 2+2||50%|
|6||RS 2+2||50%||RS 2+2||50%|
|7||RS 4+2||66%||RS 4+2||66%|
|8||RS 4+2||66%||RS 4+2||66%|
|9||RS 4+2||66%||RS 6+2||75%|
|10||RS 4+2||66%||RS 6+2||75%|
|11||RS 4+2)||66%||RS 6+2||75%|
|12||LRC (8,2,1)||72%||RS 6+2||75%|
|13||LRC (8,2,1)||72%||RS 6+2||75%|
|14||LRC (8,2,1)||72%||RS 6+2||75%|
|15||LRC (8,2,1)||72%||RS 6+2||75%|
|16||LRC (8,2,1)||72%||LRC (12,2,1)||80%|
Accelerating parity volumes
In Storage Spaces Direct it is possible to create a hybrid volume. A hybrid volume is essentially a volume where some of the volume uses mirror resiliency and some of the volume uses parity resiliency.
Figure 4 Hybrid Volume
The purpose of mixing mirror and parity in the volume is to provide a balance between storage performance and storage efficiency. Hybrid volumes require the use of the ReFS on-disk file system as it is aware of the volume layout:
- ReFS always writes data to the mirror portion of the volume, taking advantage of the write performance of mirror
- ReFS rotates data into the parity portion of the volume when needed, taking advantage of the efficiency of parity
- Parity is only calculated when rotating data into the parity portion
- ReFS writes updates to data stored in the parity portion by placing new data in the mirror portion and invalidating the old stored in to parity portion – again to take advantage of the write performance of mirror
ReFS starts rotating data into the parity portion at 60% utilization of the mirror portion and gradually becomes more aggressive in rotating data as utilization increases. It is highly desirable to:
- Size the mirror portion to twice the size of the active working set (hot data) to avoid excessive data rotation
- Size the overall volume to always have 20% free space to avoid excessive fragmentation due to data rotation
I hope this blog post helps provide more insight into how mirror and parity resiliency works in Storage Spaces Direct, how data is laid out across servers, and how data is reconstructed in various failure cases.
We also discussed how Local Reconstruction Codes (LRC) increases the efficiency of data reconstruction in both reduced storage IO churn and CPU cycles, and overall helps reach a healthy system quicker.
And finally we discussed how hybrid volumes provide a balance between the performance of mirror and the efficiency of parity.
Let me know what you think.
Until next time