The Impact of a Storage Device Failure in vSAN ESA versus OSA

vSAN stores data in a resilient way so that it can accommodate for a myriad of potential failures, such as host failures, storage device failures, and network partitions. When it identifies a failure, it will automatically reconstruct or resynchronize this less resilient data so that it can regain its prescribed level of resilience.

Over the years, VMware has developed several mechanisms that will resynchronize this data faster, more efficiently, and in a minimally invasive way. But those enhancements couldn’t address one characteristic of the Original Storage Architecture (OSA) and its handling of a failed storage device: A relatively large boundary of impact.

One of the many design goals of the vSAN ESA was to reduce this boundary of impact on data and resources upon the occurrence of a storage device failure or maintenance event. Minimizing the failure or maintenance domain of a storage device not only reduces the amount of data impacted but also decreases the amount of data, time, and resources consumed to regain its prescribed level of resilience.

But this improvement leads to an inevitable question. How much better will the ESA be in handling storage device failures versus the Original Storage Architecture? Let's look at how the ESA achieves this improvement, and provide a few simple examples to understand the degree of improvement the ESA provides over the OSA.

Shrinking the Boundary of Failure

Since the very first edition of vSAN, the concept of a disk group has been used in the OSA to provide storage resources to a vSAN cluster. Each disk group in a host consists of one higher-performing caching/buffering device, and one or more value-based capacity devices to provide a blend of performance and capacity. All I/O is funneled through the caching/buffering device before it is stored on a capacity device in the disk group.

The combination of high-performing NVMe storage devices paired with the architectural differences of the ESA allowed VMware to do away with the limitations and complexities of disk groups. This makes the ESA significantly easier to configure and operate, as there are no longer decisions on how many devices should be in a disk group, and which devices should be used as caching versus capacity. Simply claim the desired devices for use with vSAN ESA, and it takes care of the rest.

But there is another, possibly more important characteristic in how ESA uses claimed storage devices. Every storage device claimed by vSAN ESA remains independent from each other. Unlike a disk group, where the capacity devices had a dependency on the caching device, and sometimes even the other devices in the disk group, the NVMe-based storage devices used by vSAN ESA store data and metadata independently from each other. This means there is no dependency between storage devices on a host.

Figure 1. Storage devices claimed for use by the host in a cluster running vSAN ESA.

The approach used by the ESA shrinks the boundary of impact for both maintenance activities, and failure scenarios of a discrete storage device. Examples are shown below that demonstrate the effective improvement of a discrete device failure in an OSA cluster compared to an ESA cluster. This is a theoretical calculation, simplified for comparison using rounded numbers for clarity. It assumes the following:

The hosts in the OSA cluster and ESA cluster use as similar hardware as possible.
The average capacity consumption of each device is about 50%.
Both environments are storing the data as a RAID-6 erasure coding with at least 7 hosts in the cluster. (Ensuring sufficient capacity and/or fault domains to regain the prescribed level of resilience).
The OSA’s capacity devices are configured using 6 capacity devices in each disk group on a host, totaling either 2 disk groups or 4 disk groups depending on the example.
The OSA uses an additional device per disk group acting as the caching/buffering device.
The failure of a device on the OSA host is the caching/buffering device. Although a similar result would occur if a capacity device failed when the deduplication and compression service is enabled in the OSA.
The effective performance of I/O processing in the ESA cluster is 2 times the performance of the hosts in the OSA cluster. Given the examples used, this is a conservative estimate.
Networking is not a bottleneck in both environments.
Other variables were omitted for clarity and simplicity of the calculation.

Example using 12 Storage Devices per Host

In the example shown in Figure 2, the hosts in the OSA cluster and ESA cluster use 12, 8TB storage devices for capacity, providing about 96TB of raw capacity per host. The assumption that 50% of the capacity is used means there is about 48TB of data stored on each host in their respective clusters.

vSAN OSA Host: In this example of a storage device failure in a host within a cluster running the OSA, the percentage of host capacity impacted would be 50%, while the area of impact would be 24TB of data, as the caching/buffering device would take out an entire disk group. While the data would still be available, that 24TB would need to be resynchronized elsewhere in the cluster to regain the prescribed level of resilience.

vSAN ESA Host: In this example, any storage device claimed by vSAN could fail, and the area of impact would only be 4TB of data. That is an 83% decrease in the amount of data needed to be resynchronized to regain its prescribed level of resilience. That is a massive saving on its own. But given an assumed 2x performance advantage the ESA has over the OSA, it would yield about a 92% reduction in the time to regain the prescribed level of resilience. In other words, it will take about 1/10th the time to regain the prescribed level of resilience in the ESA versus the OSA, for a similar failure of a storage device.

Figure 2. Comparing the area of impact of a single device failure in the OSA versus the ESA using hosts with 12 storage devices.

Example using 24 Storage Devices per Host

In the example shown in Figure 3, the hosts in the OSA cluster and ESA cluster use 24, 8TB storage devices for capacity, providing about 192TB of raw capacity per host. The assumption that 50% of the capacity is used means there is about 96TB of data stored on each host in their respective clusters.

vSAN OSA Host: Unlike the previous example, the percentage of host capacity impacted is reduced from 50% to 25% because of the use of 4 disk groups instead of 2. But the area of impact would still be 24TB of data and would need to be resynchronized elsewhere in the cluster to regain the prescribed level of resilience.

vSAN ESA Host: In this example, any storage device claimed could fail, and the area of impact would only be 4TB of data. Just as with the first example, that is an 83% decrease in the amount of data needed to be resynchronized to regain its prescribed level of resilience. Much like the first example, it would be an 83% decrease in the amount of data needed to be resynchronized, and about a 92% reduction in time to regain the prescribed level of resilience.

Figure 3. Comparing the area of impact of a single device failure in the OSA versus the ESA using hosts with 24 storage devices.

So even though host configurations and other variables can change the resulting improvement, the ESA in vSAN can accommodate device failures or maintenance activities in a much more efficient and less impactful way.

Summary

The fastest resynchronizations are the ones that never happen. With the design of vSAN ESA minimizing the boundary of impact for discrete storage device failures to just the device itself, it is yet another reason why all your new cluster refreshes should be using the vSAN ESA.

@vmpete

The Impact of a Storage Device Failure in vSAN ESA versus OSA

Shrinking the Boundary of Failure

Example using 12 Storage Devices per Host

Example using 24 Storage Devices per Host

Summary

Filter Tags