Improving vSAN’s Data Resilience through Intelligent Detection of Failed Devices

VMware Global Support (GS) has unique insight into common operational or technical issues that our customers experience. This also provides a front-row seat to how enhancements in vSAN have a material benefit to our customers. This post describes an enhancement that improves the intelligence in detecting and handling of failed storage devices.

The primary charter of enterprise-class storage is the reliable availability of the data. VMware looks for every opportunity to improve the availability of data and will introduce subtle but important enhancements to failure handling events. One such enhancement is how vSAN handles storage device errors, which is described in more detail below.

When vSAN encounters a failure that results in some form of degradation of resilience, it will usually result in a resynchronization of data to regain the levels of resilience prescribed by the associated storage policy. This behavior is documented extensively in “Intelligent Rebuilds in vSAN.”

A system could experience an intermittent error or loss of connectivity to the device, or there can be a permanent error declared by the device in the form of a SCSI sense code. The actions to be taken from a permanent error are relatively simple because vSAN will know the future status of the failed device. Intermittent errors are a bit more challenging due to their transient nature. In vSAN 6.7 P02 and vSAN 7, new logic was introduced to minimize potentially unnecessary rebuild activity when a storage device experiences an intermittent issue.

Recommendation: Many I/O-related issues can come from the outdated drive and/or controller firmware and drivers. Keeping the versions of ESXi and controller driver and firmware up-to-date helps to avoid potential I/O-related issues altogether.

I/O Failure Classification

vSAN provides two major types of I/O failure classifications: Permanent errors and transient errors. Any unrecovered error on write I/O activity is classified as a permanent error, whereas everything else, such as I/O timeouts, are considered transient errors.

How Does it Work?

As I/O requests leave a guest VM, they traverse down the various layers in vSAN before coming to the Pluggable Storage Architecture layer, or PSA. The PSA is the lowest layer in the hypervisor storage stack that exists in both three-tier architectures, and vSAN. It is the PSA that interacts with the physical storage devices, as shown in Figure 1.

Figure 1. I/O traversal from vSAN to the storage devices

Upon a condition of a transient error, the I/O is retried by the PSA (shown as step #3 in Figure 1) until the timeout of vSAN layers is reached. If the timeout is reached, the I/O is deemed unsuccessful and is reported back to the lower layers of vSAN as a transient error.

At this stage the following log message is added to /var/run/vobd.log:

vSAN device 52ba602b-5cb7-07c8-d51c-fac0d669cd96 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

And the “needsRepair” flag is set to 1 for the involved disk:

[root@cs-tse-sm01:~] vsish -e get /vmkModules/plog/devices/naa.55cd2e404c2105aa/info

Device information {

   Disk Name:naa.55cd2e404c2105aa

   VMFS Volume Name:

   VSAN UUID:52c9b887-07cb-c32f-2d68-edc962f21d7f

   VSAN host UUID:582d79d9-4664-16e0-0735-0cc47ac73060

   …

   Device Recovery Progress:0

   Device Recovery Started:0

   Does the device need repair (unmount-mount)?:1

}

This flag is checked by Degraded Device Handling (DDH) every 10 minutes by default.

Figure 2. DDH device check

When DDH finds a disk with the needsRepair flag set to 1, a remount is attempted for the device in question. The entire disk group will be remounted if deduplication and compression is enabled, or the device is used as the caching device for the disk group. If the remount operation is successful, the needsRepair flag is reset to 0.

From an operational perspective, this functionality is seamless and aside from the log messages and the short unavailability of the disk group (assuming the repair is successful), it probably won’t even be noticed.

Avoiding Device Flapping from Transient Errors

Device “flapping” is a term used to describe when a device experiences frequent disconnect and reconnect events. This type of flapping can create other conditions, such as causing vSAN object components to go into an “absent” state intermittently, which could affect whether the object is compliant with the prescribed storage policy. Therefore, the logic includes a counter to limit how many times a remount and repair operation can be performed before the device is marked as permanently failed. The counter limits this repair operation on a device to three times per week.

Summary

The enhancements described in this post improve vSAN’s ability to account for intermittent, transient error conditions of storage devices. Through a more intelligent detection of failed storage devices, vSAN will detect these device failure conditions more accurately, and minimize resource usage for potential unnecessary resynchronization activities.

- Manuel Moser

Improving vSAN’s Data Resilience through Intelligent Detection of Failed Devices

I/O Failure Classification

How Does it Work?

Avoiding Device Flapping from Transient Errors

Summary

Filter Tags