Improving vSAN's Resilience against Unrecovered Read Errors on Devices

Resilience against errors and failures is one of the most important attributes of enterprise-class storage. Obviously, this is no different for vSAN and while it is already extremely resilient against all kinds of issues, VMware always strives to further improve existing functionality and error handling. This post describes a new improvement to how vSAN reacts to and handles Unrecovered Read Errors encountered on flash devices.

Before diving into the improvement itself, let’s first recap on what an Unrecovered Read Error is, when such an error can be hit, and how it’s reported in the logs.

An Unrecovered Read Error, or URE, is a type of Medium Error that is encountered when a bad block on a device is attempted to be read. Essentially, when a read IO is issued against a bad block, the hardware reports a Medium Error of the type URE up the stack to vSAN. In the vmkernel logs this can then be seen as a SCSI error:

2022-04-25T07:14:58.072Z cpu0:262165)ScsiDeviceIO: 4161: Cmd(0x4548c0eed888) 0x28, CmdSN 0xf from world 0 to dev "mpx.vmhba0:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 2048

Translation for the highlighted parts:

Cmd … 0x28: 10-byte SCSI read command
D:0x2: Device-side error code Check Condition
0x3: Sense Key for Medium Error
0x11 0x0: Additional Sense Code and Qualifier for Unrecovered Read Error

The above values are defined in the T10 SCSI standard: https://www.t10.org/

For NVMe devices a status message can be seen in the logs:

2022-04-25T10:25:51.450Z cpu24:2097243)0000:89:00.0 nvme_ScsiEnd: status:0281

The second digit (2) of the NVMe status defines the so-called Status Code Type and the last two digits (81) define the Status Code. Status Code Type 2 stands for “Media and Data Integrity Errors” and Status Code 81 stands for “Unrecovered Read Error”. These are defined in the NVMe specification here: https://nvmexpress.org/.

Such a URE can be encountered in two regions of a vSAN device: the vSAN metadata region and the region where the “user data” lives, such as VMDK objects, namespace objects, and so on.

Firstly, there is no change in how vSAN handles UREs encountered in the user data region. When data from one component of an object with a storage policy of Failures To Tolerate, or FTT, of 1 or more can’t be read due to a URE, then vSAN will read the data from the remaining healthy component(s) and rebuild the bad data block for the affected component. This is also when you would see the message “vSAN detected and fixed a medium or checksum error for component …” in the logs.

On the other hand, UREs encountered in the vSAN metadata region need to be handled differently and this is where this new improvement in vSAN 7.0 U3c comes in.

Handling UREs in the vSAN Metadata Region with vSAN 7.0 U3c

If a URE is encountered in the vSAN metadata region, vSAN will flag the affected disk as unhealthy and in need of data evacuation, or the entire disk group if deduplication and compression is enabled, or the affected device is used as the caching device for the disk group. vSAN then attempts to move all objects with a storage policy of FTT=0 to another disk group. If there are objects with a storage policy of FTT=1 or more and the last active, healthy component of such an object is on the affected disk or disk group, vSAN will attempt to move those to another disk group as well.

Once the mentioned components are evacuated, vSAN will remove the affected disk from its disk group, perform a trim operation on it, and then re-add the disk to its former disk group. If deduplication and compression is enabled or the affected disk is used as the caching device for the disk group, the whole disk group gets removed, a trim operation is performed on all involved disks, and then the disk group will be re-created.

The trim operation causes the device firmware to remap bad blocks to spare blocks on the device, which ensures that the device can be fully used again and vSAN won’t encounter a URE again at that particular address on the device.

What to look for in the H5 client and the logs

When the evacuation starts, the user is alerted to it in Skyline Health (Physical disk -> Operation health):

Once the evacuation is completed, the “Operation State Description” changes to “The disk is evacuated and will be rebuilt”:

In the vobd logs the following message is logged for the evacuation:

[vob.vsan.lsom.metadataURE] vSAN device <uuid> encountered unrecoverable read error. This disk will be evacuated and rebuilt. If the device is part of a dedup disk group, the entire disk group will be evacuated and rebuilt.

After the rebuild the following messages for adding a device back to its disk group and rebuild of the disk group can be seen in the vobd logs respectively:

[vob.vsan.lsom.devicerebuild] Device <deviceID> added back successfully. Old UUID <uuid> New UUID <uuid>. 

[vob.vsan.lsom.diskgrouprebuild] Diskgroup <deviceID> rebuilt successfully. Old UUID <uuid> New UUID <uuid>.

Based on the above log messages you can also create custom alerts in vRealize Log Insight to get notified about such an operation straight away.

What if there is not enough space available on the remaining disk groups for the evacuation?

Usually enough spare capacity should be kept, to be able to tolerate the evacuation of an ESXi host. But, of course, there is a guardrail for situations when there isn’t enough space available anymore for the evacuation, and there’s no risk of a premature rebuild.

In such a scenario vSAN will pause the evacuation and flag this in both Skyline Health and the logs:

[vob.vsan.lsom.UREEvacuationFailed] Evacuation has failed for device <deviceID> due to insufficient resources, rebuild will be skipped. Please make resource available.

vSAN will then wait for the user to make more space available and once there is, the evacuation resumes, and the workflow continues as usual.

Note: If you want to ensure enough capacity is reserved for a host rebuild in your environment, have a look at this blog post: Understanding “Reserved Capacity” Concepts in vSAN

Summary

The enhancement described in this post improves vSAN’s ability to handle situations where Unrecovered Read Errors are encountered due to bad blocks on devices and ensures that affected disks are evacuated as much as possible under the circumstances.

Improving vSAN's Resilience against Unrecovered Read Errors on Devices

Handling UREs in the vSAN Metadata Region with vSAN 7.0 U3c

What to look for in the H5 client and the logs

What if there is not enough space available on the remaining disk groups for the evacuation?

Summary

Filter Tags