vSAN Management, Monitoring & Hardware Testing

vSAN Management Tasks

Maintenance Mode

In this section we shall look at management tasks, such as the behavior when placing a host into maintenance mode and the evacuation of a disks from a host. We will also look at how to turn on and off the identifying LEDs on a disk drive.

There are several options available when placing a host into maintenance mode. The first step is to identify a host that has a running VM:

Ensure that there is at least one VM running on the vSAN cluster. Then select the summary tab of the VM to see which host it is running on:

Graphical user interface, application</p>
<p>Description automatically generated

Then, navigate to Monitor > vSAN > Physical disk placement to show which hosts the data components reside on. Clicking on ‘Group components by host placement’ shows this more clearly:

Graphical user interface, application</p>
<p>Description automatically generated

We can see that the host 10.159.21.10 hosts both the running VM and several data components. This is the host that we shall place into maintenance mode.

Find the host in the inventory view, and from the actions list select Maintenance Mode from the drop-down menu, then select the option Enter Maintenance Mode

Graphical user interface, application, Teams</p>
<p>Description automatically generated

On screen that appears, we are presented with options for vSAN data migration. Here, click on ‘Go to pre-check’. This will enable us to safely test the various scenarios without affecting the system: 

Graphical user interface, text, application, Teams</p>
<p>Description automatically generated

The pre-check can also be accessed by navigating to [cluster] > monitor > data migration pre-check

We can see that there are three options available to enable maintenance mode when vSAN is present

  • Full data migration: move all the data away from the affected resource before maintenance.
  • Ensure accessibility: first check for any issues before resources become (temporarily) unavailable. No data is moved.
  • No data migration: perform no checks nor move any data.

On the next screen, we can pick one of the options to test the scenario. For maintenance operations where the host is temporarily out of service, the recommendation is to select ‘ensure accessibility’. This will ensure that there are sufficient resources available to service the VM (albeit in a compromised state) for the duration of the vSAN cluster services timer. We can see this below:

Graphical user interface, website</p>
<p>Description automatically generated

As the host owns a number of data object components, when the host is offline, those components will be unavailable. Here we compromise moving data objects for a relatively quick maintenance operation.

Perhaps unsurprisingly, if we chose ‘full data migration’ as the scenario (where all the components will be moved away from the host before enabling maintenance mode) then the components are fully protected: 

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Changing the maintenance mode option back to ‘ensure accessibility’, we can select ‘enter maintenance mode’ to observe what happens:

Graphical user interface, website</p>
<p>Description automatically generated

We receive a warning regarding migrating running VMs. If DRS is set to ‘fully automated’, the virtual machines should be automatically migrated.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

After the host has entered maintenance mode, we can now examine the state of the components. As expected, since the host is offline, some components are marked as ‘absent’. The VM and its data remains accessible.

Graphical user interface, website</p>
<p>Description automatically generated 

You may also notice a component named ‘durability’. From vSAN 7u1, an extra component was introduced to increase resiliency during offline operations.  When a host enters maintenance mode, the durability component starts capturing writes to another host. When the original host comes out of maintenance mode, these delta writes are applied back to the original components. This helps protect against any other failures while in maintenance mode.

For more details on durability components, visit:
https://blogs.vmware.com/virtualblocks/2020/09/23/enhanced-durability-during-maintenance-mode-operations/

https://blogs.vmware.com/virtualblocks/2021/03/18/enhanced-data-durability-vsan-7-update-2/

 

 

Graphical user interface, website</p>
<p>Description automatically generated

To take the host out of maintenance mode, navigate to the host. From the actions menu select Maintenance Mode > Exit Maintenance Mode:

Graphical user interface, application</p>
<p>Description automatically generated

After exiting maintenance mode, the “Absent” component becomes "Active" once more (assuming that the host exited maintenance mode before the 60-minute vSAN cluster services timer expires):

Graphical user interface, website</p>
<p>Description automatically generated
We shall now place the host into maintenance mode once more, but this time we will choose full data migration’. As the name suggests, this will move all data components from the affected host to the remaining hosts in the cluster (thus ensuring full availability during the maintenance operation).

Note that this is only possible when there are enough resources in the cluster to cater for the affected components: there must be enough hosts remaining to fulfil the storage policy requirements, along with enough space left on the hosts. When we run the pre-check again, we can see that around 2GB data will be moved, and that the VM will be fully protected (health status is fully green):

Graphical user interface, application, website</p>
<p>Description automatically generated

As expected, all the components of our VM remain active and healthy. We see that some components have been moved to other hosts:

Graphical user interface</p>
<p>Description automatically generated

Now, if we try to place another host into maintenance mode, there will be a warning shown. Safeguards are in place when multiple hosts are requested to enter maintenance mode at the same time (along with similar scenarios where recent operations have caused resync activity). This ensures that there are no multiple unintended outages that may cause vSAN objects to become inaccessible.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

If we look at the pre-check (for the full data migration scenario) for this additional host, we can see that this produces errors. The second host cannot enter maintenance mode with the current storage policy and resources available:

 Graphical user interface, text, application, email</p>
<p>Description automatically generated
 

Ensure that you exit maintenance mode of all the hosts to restore the cluster to a fully functional state.

 

Remove a Disk

Here, we demonstrate how to remove a disk in vSAN.

Navigate to [vSAN cluster] > Configure > vSAN > Disk Management. Select a disk and then click on ‘view host objects’ to confirm that there are VM objects on the host 

Graphical user interface, text</p>
<p>Description automatically generated

Below we can see that there is indeed VM data on the host 10.159.21.11. Once confirmed, click on ‘back to host list’:

Graphical user interface, text, website</p>
<p>Description automatically generated

 Then click on ‘view disks’ on the host with VM data:

Graphical user interface, text, website</p>
<p>Description automatically generated

Depending on the architecture of vSAN deployed (OSA or ESA), disk groups or a disk pool will be shown.

 

vSAN ESA Cluster

Here, we can select any disk we’d like to remove.

Graphical user interface</p>
<p>Description automatically generated

Just like with host maintenance, we can choose between the three options:

  • Full data migration: move all the data away from the affected resource before maintenance.
  • Ensure accessibility: first check for any issues before resources become (temporarily) unavailable. No data is moved.
  • No data migration: perform no checks nor move any data.

As before, with the ‘ensure accessibility’ scenario, we can see that some objects will become non-compliant with the storage policy. We can, however, remove or unmount the disk without affecting the running VM.

Graphical user interface, text, application</p>
<p>Description automatically generated

 Clicking on remove or unmount will bring up a confirmation window:

Graphical user interface, text, application</p>
<p>Description automatically generated

Graphical user interface, text, application</p>
<p>Description automatically generated

Removing or unmounting the disk group will cause the VM data components to go into the ‘absent’ state.
Remember to re-add back any disks or disk groups removed.

 

vSAN OSA Cluster

Here, we are presented with the disk groups on the host:

Graphical user interface, text, email, website</p>
<p>Description automatically generated

Expand one of the disk groups. Select a capacity disk and click on ‘go to pre-check’:

Graphical user interface</p>
<p>Description automatically generated

Just like with host maintenance, we can choose between the three options:

  • Full data migration: move all the data away from the affected resource before maintenance.
  • Ensure accessibility: first check for any issues before resources become (temporarily) unavailable. No data is moved.
  • No data migration: perform no checks nor move any data.

As before, with the ‘ensure accessibility’ scenario, we can see that some objects will become non-compliant with the storage policy. We can, however, remove the disk without affecting the running VM:

Graphical user interface, text, application</p>
<p>Description automatically generated

Clicking on remove will bring up a confirmation window:

Graphical user interface, application</p>
<p>Description automatically generated

Removing or unmounting the disk group will cause the VM data components to go into the ‘absent’ state.

We can also do the same with the whole disk group. Navigate back to Configure > Disk Management and select the host again and ‘view disks’. This time, click on the ellipses (three dots) to the left of ‘disk group’. This will bring up a list of options:

Graphical user interface</p>
<p>Description automatically generated

Click on ‘go to pre-check’. Again, we run the ‘ensure accessibility’ scenario and see that we can remove the whole disk group without affecting the VM runtime (rendering the storage policy non-compliant).

Graphical user interface, website</p>
<p>Description automatically generated

Again, clicking on unmount, recreate or remove will bring up a confirmation dialog:

Graphical user interface, text, application</p>
<p>Description automatically generated

 

Graphical user interface, text, application, email</p>
<p>Description automatically generated

 

Graphical user interface, text, application</p>
<p>Description automatically generated

 

Removing or unmounting the disk group will cause the VM data components to go into the ‘absent’ state.
Remember to re-add back any disks or disk groups removed.
 

Turning On/Off Disk LEDs

vSAN supports toggling disk locator LEDs natively for LSI controllers and some NVMe devices. Other controllers are supported via an installed utility (such as hpssacli when using HP controllers) on each host. Refer to vendor documentation for information on how to locate and install this utility.
For Intel NVMe devices specifically, see https://kb.vmware.com/s/article/2151871

To toggle a locator LED, select a disk and click on the ellipses (three dots) above the table:

Graphical user interface, text, website</p>
<p>Description automatically generated 

This will launch a vCenter task (for instance, ‘turn on disk locator LEDs’). To see if the task was successful, go to the ‘monitor’ tab and check the ‘events’ view. If there is no error, the task was successful. Obviously, a physical inspection of the drive will show the state of the LED.

 

Lifecyle Management

Lifecycle management is performed via vSphere Lifecycle Management (vLCM). This builds on the previous generation (vSphere Update Manager) with many new features. vLCM operates at the cluster level using a ‘desired-state’ model, which will attempt to reconcile the system to the settings prescribed and remediate if there is a drift (with adherence to the VMware Compatibility Guide). This reduces the effort to monitor compliance for individual components and helps maintain a consistent state for the entire cluster. Moreover, vLCM provides both the lifecycle management for the hypervisor and the full stack of drivers and firmware.

Note that vSphere Update Manager (VUM) has been deprecated from vSphere 8.0. See the following support article for more information: https://kb.vmware.com/s/article/89519

 

Using vLCM to set the desired image for a vSAN cluster

There are three pre-requisites to using vLCM:

  • All hosts are at version 7.0 or higher
  • Hosts need to be from the same vendor
  • Hosts need to have a local store (should not be stateless)

A vLCM desired state image consists of a base ESXi image (required), plus any vendor and firmware and driver addons: 

  • Base Image: The desired ESXi  version that can be pulled from vSphere depot or manually uploaded.
  • Vendor Addons: Packages of vendor specified components such as firmware and drivers.

With vSphere 8.0, when creating a cluster, the option to manage hosts with a single image is pre-selected:

For an existing cluster created without using this option, navigate to [cluster] > updates and click on ‘manage with a single image’:

Graphical user interface, text, application</p>
<p>Description automatically generated

Here, we can either chose to setup an image with pre-existing versions and addons, or import an image spec via a JSON file or URL:

Graphical user interface, text, application</p>
<p>Description automatically generated
 
For further details of setting up and using vLCM, visit:
https://core.vmware.com/resource/introducing-vsphere-lifecycle-management-vlcm

 vLCM using Hardware Support Manager (HSM) 

In the previous section, an image was created to be used by vLCM to continuously check against and reach the desired state. However, this step only covers the configuration of the ESXi image. To fully take advantage of vLCM, repositories can be configured to obtain firmware and drivers, among others, by leveraging the Hardware Support Manager provided by the vendor.

VMware maintains a compatibility list of HSMs here: https://www.vmware.com/resources/compatibility/search.php?deviceCategory=hsm

In this example, Dell OpenManage Integration for VMware vCenter (OMIVV) will be mentioned. Deploying and configuring HSM will not be covered in this guide, as this varies by vendor.

Overview of steps within HSM prior to vLCM integration (steps may vary)

  • Deploy HSM appliance and register plugin with vCenter
  • Configure host credentials through a cluster profile
  • Configure repository profile (where vLCM will get firmware and drivers)

First navigate to [Cluster] > Updates, then edit:

Graphical user interface, text, application, email, website</p>
<p>Description automatically generated
Then, click on ‘select’ next to ‘firmware and drivers addon’:

Select the desired HSM, then select firmware and driver addon (previously created profile in HSM, and then save the image settings.

Graphical user interface, text, application</p>
<p>Description automatically generated

Image compliance check will initiate and the option to remediate will be available

Graphical user interface</p>
<p>Description automatically generated

 

Scale Out vSAN

Add a Host into the Cluster

Here, we will add an extra host to our vSAN cluster. First, look at the existing capacity on the cluster. Navigate to [Cluster] > Monitor > vSAN > Capacity

Graphical user interface, text, application, email, website</p>
<p>Description automatically generated

After installing the correct ESXi version on the host, add it to vCenter and place in maintenance mode. Ensure that a VMkernel adaptor for vSAN (and ideally vMotion) have been created on the host, and IP addresses have been assigned. If a distributed switch is used in the cluster, add this host to the switch.

Graphical user interface, website</p>
<p>Description automatically generated

Having verified that the networking is configured correctly, select the new host in the inventory. From the ‘actions’ menu, select ‘move to’ and select the cluster to place the host into:

Graphical user interface, text, application</p>
<p>Description automatically generated

On the next screen, accept the default option of not creating a new resource pool:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

The host will then be placed into the cluster. Next, navigate to the Hosts and Clusters view and exit maintenance mode. Verify that the cluster now contains the new node.

Graphical user interface, text, application</p>
<p>Description automatically generated

In this example, we see that there are now four hosts in the cluster. However, you will also notice from the Capacity view that the vSAN datastore has not changed with regards to total and free capacity. This is because vSAN does not claim any of the new disks automatically. These disks will need to be claimed as an additional step.

At this point, it would be good practice to re-run the health check tests, under [Cluster] > Monitor > vSAN > Skyline Health and address any issues seen. In particular, verify that the host appears in the same network partition group as the other hosts in the cluster. This can be seen with the ‘vSAN cluster partition’ test:

Graphical user interface, application, website</p>
<p>Description automatically generated

Here we have added a host using the manual procedure for greater clarity. Another way to achieve this would be to use cluster Quickstart: [Cluster] > Configure > Quickstart:

Graphical user interface, text, application</p>
<p>Description automatically generated

For more information on Quickstart, visit:
https://core.vmware.com/resource/cluster-quickstart

 

Claiming Disks on the New Host

Navigate to [cluster] > configure > disk management and click on ‘claim unused disks’:

Graphical user interface</p>
<p>Description automatically generated
The procedure varies here for OSA (where disk groups are used) and ESA (storage pool model) clusters.

OSA Cluster

Select the appropriate tier for each type of disk (cache or capacity). The number of cache disks determines the number of disk groups created, i.e., one cache disk per disk group. In the example below, we have two cache disks and four capacity drives (thus the resultant configuration will be two disk groups, each with two capacity disks). Clicking on ‘create’ initiates the disk group creation:

Graphical user interface</p>
<p>Description automatically generated with medium confidence


ESA Cluster

Select the disks to add to the storage pool and click on ‘create’:

Graphical user interface, text, application</p>
<p>Description automatically generated

Monitor the creation task (for either type of cluster) in the cluster events view. Verify that the disks have been claimed, and the ‘health’ is green:

Graphical user interface, website</p>
<p>Description automatically generated 

 

Verify New vSAN Datastore Capacity

The final step is to ensure that the vSAN datastore has now grown with the addition of the new disks. Return to the capacity view by navigating to [cluster] > Monitor > vSAN > Capacity and examine the total and free capacity fields:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

 

Monitoring vSAN

To effectively monitor vSAN, there are several elements that need consideration. Below we will look at the overall health and capacity views; resynchronization and rebalance operations and performance metrics available in vCenter.

Overall vSAN Health

For a quick summary of the health of a vSAN cluster, vSAN Skyline Health provides a consolidated list of health checks. These checks are grouped into several categories, such as hardware compatibility, physical disk health and networking.

Navigate to [vSAN Cluster] > Monitor vSAN Skyline Health. This will show the holistic health state of the cluster, along with any alerts. On multiple issues (whereupon many alerts are generated) the system will try to list the primary issue affecting the cluster:

Graphical user interface, text, application, email, website</p>
<p>Description automatically generated

Selecting the issue and clicking on ‘info’ will show more information about the problem. Further, clicking on ‘ask VMware’ will open a knowledgebase article on how to fix it:

Graphical user interface, text, email, website</p>
<p>Description automatically generated

 

More information about this is available here:

https://blogs.vmware.com/virtualblocks/2019/07/18/working-with-vsan-health-checks/

 

 

 

vSAN Capacity

vSAN storage capacity usage may be examined by navigating to [vSAN Cluster] > Monitor > vSAN > Capacity.

This view provides a summary of current vSAN capacity usage and displays historical capacity usage information when Capacity History is selected. From the default view, a breakdown of capacity usage per object type is presented. In addition, a capacity analysis tool that facilitates effective free space remaining with respect to individual is available. 

  
Graphical user interface, application</p>
<p>Description automatically generated
 

Note that beginning with vSAN 7.0, the vSAN UI now distinguishes vSphere replication objects within the capacity view.

Prior to vSAN 7u1, VMware recommended reserving 25-30% of total capacity for use as “slack space”.  This space is utilized during operations that temporarily consume additional storage space, such as host rebuilds, maintenance mode, or when VMs change storage policies. Beginning with vSAN 7u1, this concept has been replaced with “capacity reservation”.

An improved methodology for calculating the amount of capacity set aside for vSAN operations yields significant gains in capacity savings (up to 18% in some cases). Additionally, the vSAN UI makes it simple to understand what amount of capacity is being reserved for temporary operations associated with normal usage, versus for host rebuilds (one host of capacity reserved for maintenance and host failure events).

This feature should be enabled during normal vSAN operations. To enable this new feature:

Click Reservations and Alerts and toggle the Operations Reserve and the Host Rebuild Reserve options. With ‘customize alerts’ custom thresholds can be set:

Graphical user interface, text, application</p>
<p>Description automatically generated

 
Note that when Operations Reserve and Host Rebuild Reserve are enabled, “soft” thresholds are implemented that will attempt to prevent over-consumption of vSAN datastore capacity. In addition to triggering warnings/alerts in vSphere when capacity utilization is in danger of consuming space set aside as reserved, once the capacity threshold is met, operations such as provisioning new VMs, virtual disks, FCDs, clones, iSCSI targets, snapshots, file shares, or other new objects consuming vSAN datastore capacity will not be allowed.

Note, I/O activity for existing VMs and objects will continue even if the threshold is exceeded, ensuring that current workloads remain available and functioning as expected.

As VMs will continue to be able to write to provisioned space, it is important that administrators monitor for capacity threshold alerts and take action to free up (or add) capacity to the vSAN cluster before capacity consumption significantly exceeds the set thresholds.

vSAN 7.0u2 introduces additional monitoring capabilities for oversubscription on the vSAN datastore. Within the vCenter UI, an estimate of the capacity required if thin-provisioned objects were fully provisioned has been added to the monitoring summary at vSAN Datastore > Monitor > vSAN > Capacity:

 

Resync Operations

Another very useful view is the [vSAN Cluster] > Monitor > vSAN > Resyncing Objects view. This will display any resyncing or rebalancing operation that might be taking place on the cluster. For example, if there was a device failure, resyncing or rebuilding activity could be observed here. Resync can also happen if a device was removed or a host failed, and the CLOMd (Cluster Logical Object Manager daemon) timer expired. Resyncing objects dashboard provides details of the resync status, amount of data in transit, and estimated time to completion.

With regards to rebalancing, vSAN attempts to keep all physical devices at less than 80% capacity. If any physical device capacity passes this threshold, vSAN will move components from this device to other devices in the cluster to rebalance the physical storage.

In an ideal state, no resync activity should be observed, as shown below. 

Graphical user interface, text, website</p>
<p>Description automatically generated
 
Resyncing activity usually indicates:

  • Failure of a device or host in the cluster
  • Device has been removed from the cluster
  • Physical disk has greater than 80% of its capacity consumed
  • Policy change has been implemented which necessitates a rebuilding of a VM’s object layout. In this case, a new object layout is created, synchronized with the source object, and then discards the source object

vSAN 7.0 also introduces visibility of vSphere replication object types within the Virtual Objects view, allowing administrators to clearly distinguish replica data from other data types.

Graphical user interface, text, email</p>
<p>Description automatically generated

 

Performance Monitoring

Performance monitoring service can be used for verification of performance as well as quick troubleshooting of performance-related issues. Performance charts are available for many different levels.

  • Cluster
  • Hosts
  • Virtual Machines and Virtual Disks
  • Disk groups
  • Physical disk

A detailed list of performance graphs and descriptions can be found here:
https://kb.vmware.com/s/article/2144493 (part 1)
https://kb.vmware.com/s/article/83039 (part 2)

The performance service should be enabled by default when a vSAN cluster is created in vCenter. In case it is not, enable the performance monitoring service by navigating to [Cluster] > Configure > vSAN > Services and clicking on ‘Edit’:

 

Graphical user interface, website</p>
<p>Description automatically generated

Once the service has been enabled performance statistics can be viewed from the performance menus in vCenter. In the following example, we will examine IOPS, throughput, and latency from the Virtual Machine level and the vSAN Backend level.


vSAN Cluster Performance Graphs

To access cluster-level performance graphs, navigate to [Cluster] > Monitor > Performance. Chose an appropriate time frame and click ‘show results’:

A computer screen capture</p>
<p>Description automatically generated with medium confidence

Access the vSAN Backend performance metrics, select the BACKEND tab from the menu at the top:

A computer screen capture</p>
<p>Description automatically generated with medium confidence
The FILE SHARE tab shows information on file share performance (note that the file service is currently available for vSAN OSA only):

Graphical user interface, chart</p>
<p>Description automatically generated

vSAN 7U1 introduced some new features around performance monitoring. First, it is easier to compare VM performance. From the cluster level, click Monitor and then Performance. Now we can look at the cluster level or show specific VMs (Up to 10 at a time). This makes it easy to compare IOPS, Throughput, and Latency for multiple VMs: 

Graphical user interface</p>
<p>Description automatically generated

We can also look at the ‘top contributors’ to performance (the defined metrics are for latency, IOPS and throughput, from VM, host frontend and host backend).

Here we look at the VM write latency (over a 24-hour period): 

Graphical user interface, application</p>
<p>Description automatically generated

The BACKEND tab shows various holistic metrics – in particular any latency spikes or congestion here (for example, due to failing hardware) are easily spotted here:

Graphical user interface, application</p>
<p>Description automatically generated

 

vSAN Host Performance Graphs

In addition to the cluster level, further performance detail per host can be found by navigating to [Host] > Monitor > vSAN > Performance. This includes metrics for the backend cache, physical adapters, and host network. In particular, the physical adapter view can be useful in troubleshooting network issues:

Graphical user interface, application</p>
<p>Description automatically generated

 

IO Insight

To capture a deeper level of metrics, IOInsight gathers traces from the hosts. Navigate to [Cluster] > Monitor > vSAN > Performance and select the IOINSIGHT tab. Then click on NEW INSTANCE:

Graphical user interface, text, website</p>
<p>Description automatically generated

Then select the targets to monitor. Here we have selected all the hosts in the cluster:

Graphical user interface, application</p>
<p>Description automatically generated

Next, select the duration. The default is 10 minutes:

Graphical user interface, text, application</p>
<p>Description automatically generated

Then proceed to ‘Review’ and click on ‘Finish’ to start IOInsight. We can then see that the metric gathering has started, and the time remaining:

Graphical user interface, text, email, website</p>
<p>Description automatically generated

Once completed, click on the ellipses (three dots) and ‘View Metrics’:

Graphical user interface, application</p>
<p>Description automatically generated

The results are filtered by VM. Select a VM of interest to see detailed metrics, such as IO size/ latency distribution, IO randomness and read/write ratio:

Graphical user interface, application</p>
<p>Description automatically generated

 

 

I/O Trip Analyzer

I/O trip analyzer is a per-vm tool used to obtain a breakdown of latencies from the vSAN stack. To launch an instance, navigate to [vm] > Monitor > vSAN > I/O Trip Analyzer and click on RUN NEW TEST:

Set the time to analyze (the default is five minutes) and click on ‘RUN’ to start the test. Once the test is complete, click on VIEW RESULT:

This will then show a map of how the virtual disk in the host interacts with the network adapters and physical disks:

Clicking on any of the elements will bring up performance details for that object, for instance:

Graphical user interface</p>
<p>Description automatically generated

Advanced Statistics

In day-to-day operations, the graphs above should be sufficient for most. To view advanced and debug information, navigate to [Cluster] > Monitor > vSAN > Support and click on PERFORMANCE FOR SUPPORT:

The metrics here are extensive; vSAN layers can be examined, as long with network and CPU stats.

 

Advanced Performance Monitoring using vsantop

Beginning in vSphere 6.7 Update 3 a new command-line utility, vsantop, was introduced. This utility is focused on monitoring vSAN performance metrics at an individual ESXi host level. Traditionally with ESXi, an embedded utility called esxtop was used to view real-time performance metrics. This utility assisted in ascertaining the resource utilization and performance of the system.

Similar to esxtop, vsantop collects and persists statistical data in a RAM disk. Based on the configured interval rate, the metrics are displayed on the secure shell console (this interval is configurable, dependent on the amount of detail required). The workflow is illustrated below for a better understanding:

Graphical user interface

Description automatically generated with medium confidence 

To initiate a vsantop session, open an SSH session to a host and run ‘vsantop’. The default view shows the cluster manager (CMMDS) output. To select another field, type the letter ‘E’ (for entity), which will bring up a menu to choose other views (note, it may take a while for data to populate):

Text

Description automatically generated
For example, here we can see the vSAN ESA disk layer statistics:

For more information on vsantop, visit:

Monitoring vSAN through Integrated vRealize Operations Manager (Aria Operations Manager) in vCenter

Further metrics and detail can be seen through vRealize Operations Manager dashboards, now available as an integrated view in vCenter. This is enabled by deploying a vRealize Operations VM (or connecting to an existing instance)

You can initiate the workflow by navigating to Menu > vRealize Operations as shown below:

Graphical user interface, text, application, email</p>
<p>Description automatically generated
Fill out the details as required for the connection to vCenter and VM details:

Graphical user interface, application</p>
<p>Description automatically generated

Once complete, deployment of the VM and installation will commence. After the process is complete, you can access the predefined dashboards as shown below, using the ‘quick links’ menu:

Graphical user interface, application</p>
<p>Description automatically generated

 

The following out-of-the-box dashboards are available for monitoring purposes,

  • vCenter - Overview
  • vCenter - Cluster View
  • vCenter - Alerts
  • vSAN- Overview
  • vSAN - Cluster View
  • vSAN - Alerts

For example, in the vSAN cluster view, useful metrics such as performance and capacity are shown. This allows administrators to quickly assess the state of the cluster:

 

Graphical user interface, application</p>
<p>Description automatically generated

For furher information please review: 
https://core.vmware.com/resource/vrealize-operations-and-log-insight-vsan-environments 
 

VMware Aria Operations documentation is available here:
https://docs.vmware.com/en/vRealize-Operations/index.html

 

 

 

Testing Hardware Failures

Understanding Expected Behaviors


When conducting any failure testing, it is important to consider the expected outcome before the test is conducted. With each test described in this section, you should first read the preceding description to first understand how the test will affect the system.

Note: It is important to test one scenario at any instance and restore completely before the next test condition. Only test one thing at time.

As with any system design, a configuration is built to tolerate a certain level of availability and performance. It is important that each test is conducted within the limit of the design systematically. By default, VMs deployed on vSAN inherit the default storage policy, with the ability to tolerate one failure. When a second failure is introduced without resolving the first, the VMs will not be able to tolerate the second failure and may become inaccessible. It is important that you resolve the first failure or test within the system limits to avoid such unexpected outcomes.

VM Behavior when Multiple Failures Encountered

A VM remains accessible when a full mirror copy of the objects are available, as well as greater than 50% of the components that make up the VM (to maintain quorum).

Below, we discuss VM behavior when there are more failures in the cluster than the NumberOfFailuresToTolerate setting in the policy associated with the VM.

VM Powered on and VM Home Namespace Object Goes Inaccessible

If a running VM has its VM Home Namespace object go inaccessible due to failures in the cluster, several different things may happen. Once the VM is powered off, it will be marked "inaccessible" in vCenter. There can also be other effects, such as the VM being renamed to its “.vmx” path rather than VM name, or the VM being marked "orphaned".

VM Powered on and Disk Object is inaccessible

If a running VM has one of its disk objects become inaccessible, the VM may keep running in memory. Typically, the Guest OS will eventually time out due to I/O operations to disk. Operating systems may either crash when this occurs or downgrade the affected filesystems to read-only (the OS behavior and even the VM behavior is not vSAN specific). These effects can also be seen on VMs on traditional storage when the host suffers from an APD (All Paths Down) state.

Once the VM becomes accessible again, the status should resolve, and things go back to normal. Of course, data remains intact during these scenarios.

What happens when a Host Fails?

A host failure can occur in numerous ways, it could be a crash, or it could be a network issue (which is discussed in more detail in the next section). However, it could also be something as innocent as a reboot.

Any components that were part of the failed host are marked as ‘absent’. I/O flow to the object is restored by removing the absent component from the active set of components in the object.

The ‘absent’ state is chosen rather than the ‘degraded’ state because of the likelihood of the failure being transient (i.e. due to a reboot). For instance, a host might be configured to auto-reboot after a crash, or the host’s power was temporarily interrupted. For this reason, a set amount of time is allowed before starting to rebuild objects on other hosts, so as not to waste resources. By default, this timer is set to 60 minutes. If the timer expires, and the host has not rejoined the cluster, a rebuild of components on the remaining hosts in the cluster commences.

Moreover, if a host fails or is rebooted, this event will trigger a ‘host connection and power state’ alarm in vCenter. If vSphere HA is enabled on the cluster, it will also trigger a ‘vSphere HA host status’ alarm and a ‘Host cannot communicate with all other nodes in the vSAN Enabled Cluster’ warning message on all remaining hosts. If any VMs were running on the failed host, they are restarted on another host in the cluster.

Simulating Failure Scenarios Using Pre-Check

It can be useful to run simulations on the loss of a particular host or disk, to see the effects of planned maintenance or hardware failure. The Data Migration Pre-Check feature can be used to check object availability for any given host or disk. These can be run at any time without affecting VM traffic.

Loss of a Host

Navigate to: 

[vSAN Cluster]  Monitor  vSAN  Data Migration Pre-check


From here, you can select the host to run the simulations on. After a host is selected, the pre-check can be run against three available options, i.e., Full data migration, Ensure accessibility, No data migration:


`

Select the desired option and click the Pre-Check button. This gives us the results of the simulation. From the results, three sections are shown, i.e.: Object Compliance and Accessibility, Cluster Capacity and Predicted Health.

The Object Compliance and Accessibility view show how the individual objects will be affected:


 
Cluster Capacity shows how the capacity of the other hosts will be affected. Below we see the effects of the '
Full data migration' option:

Predicted Health shows how the health of the cluster will be affected:

 

Loss of a Disk

Navigate to: 

[vSAN Cluster]  Configure  vSAN  Disk Management

From here, select a host or disk group to bring up a list of disks. Simulations can then be run on a selected disk or the entire disk group:

Once the Pre-Check Data Migration button option is selected, we can run different simulations to see how the objects on the disk are affected. Again, the options are Full data migration, Ensure accessibility (default) and No data migration:


 
Selecting '
Full data migration' will run a check to ensure that there is sufficient capacity on the other hosts:


 

 

Conducting Failure Testing

Unlike the previous section (where the effects of known failure scenarios are depicted) here we attempt to re-create real world issues to see how the system reacts.

Host Failure without vSphere High Availability

On host failure, vSphere High Availability (HA) will restart VMs on other hosts in the cluster. Thus, without vSphere HA, any VMs running on a host that fails will become unavailable (even with vSAN storage policies set to tolerate one or more failures).

In the example below, we have a VM running on host 10.159.21.9:

Graphical user interface, application</p>
<p>Description automatically generated

To simulate the failure, we reboot the host: this is best achieved with the host’s out of band management interface, such as an iLO or iDRAC (or as a last resort, via an SSH session).

As expected, vCenter cannot communicate with the host. A ‘not responding’ message is displayed next to the host:

Graphical user interface, table</p>
<p>Description automatically generated

In the cluster view, we can see that alarms are generated pertaining to disconnected hosts:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

The remaining hosts in the cluster issue a warning that they cannot communicate with one or more others:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

All VMs on the host are errored to a ‘disconnected’ state:

Graphical user interface, table</p>
<p>Description automatically generated
Upon examining the state of the virtual objects, we again see a warning about disconnected hosts. Moreover (as expected) there is reduced availability on some components, with the vSAN cluster services timer started.

Graphical user interface</p>
<p>Description automatically generated


Simulate Host Failure With vSphere High Availability
 

We now repeat the same test as above, but with vSphere HA enabled on the cluster. Ensure that any VMs powered off from the last test are powered on again.

Ensure that both vSphere HA and DRS are enabled (navigate to [vSAN cluster] > Configure > Services).

Navigate to [vSAN cluster] > Summary to make sure:

Graphical user interface, application</p>
<p>Description automatically generated

Select a host with running VMs. We can then reboot that host, as before, and observe what happens with the protection mechanisms in place.

Graphical user interface</p>
<p>Description automatically generated

After rebooting the host, we see again that vCenter shows the host as disconnected, and the remaining hosts show a warning.

Several HA related events should be displayed on the ‘Summary’ tab of the host being rebooted (you may need to refresh the UI to see these):

Graphical user interface, application</p>
<p>Description automatically generated

We can see that vSphere HA has restarted the VMs on another host (the column ‘uptime’ has been added to show this clearly):

 Graphical user interface, application, table</p>
<p>Description automatically generated

Note: we can ignore the vCLS VM as this is tied to the host and would not be restarted by HA on another host.

As expected, upon navigating [vSAN Cluster] > Monitor > vSAN > Virtual Objects we see that some components are temporarily unavailable. Again, vSAN will wait for a prescribed time (which defaults to 60 minutes) before initiating an expensive rebuild operation.

Graphical user interface</p>
<p>Description automatically generated

 

vSAN Disk Fault Injection Script for Failure Testing

A script to help with storage device failure testing is included with ESXi and is available on all hosts. The script, vsanDiskFaultInjection.pyc can be found in /usr/lib/vmware/vsan/bin:

[root@localhost:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -h
Usage:
            injectError.py -t -r error_durationSecs -d deviceName
            injectError.py -p -d deviceName
            injectError.py -z -d deviceName
            injectError.py -T -d deviceName
            injectError.py -c -d deviceName

Options:
  -h, --help            show this help message and exit
  -u                    Inject hot unplug
  -t                    Inject unrecoverable read error
  -p                    Inject permanent error
  -z                    Inject health error
  -c                    Clear injected error
  -T                    Inject Transient error
  -r ERRORDURATION      unrecoverable read error duration in seconds
  -d DEVICENAME, --deviceName=DEVICENAME 

Warning: This command should only be used in test environments. Using this command to mark devices as failed can have a catastrophic effect on a vSAN cluster.

Note: The release of vSAN 6.7 P02  and vSAN 7.0 P01 introduced Full Rebuild Avoidance (FRA). In some circumstances, transient device errors could cause vSAN objects to be marked as degraded and vSAN may unnecessarily mark device as failed. vSAN can now differentiate between transient and permanent storage errors, thus avoiding unnecessary object rebuilds.

However, for the purposes of testing, we need to simulate hardware failures and rebuilds. The Below procedure outlines toggling this feature on or off.

As the setting is enabled on a per vSAN node basis, to view the current value issue from an ESXi host issue:

esxcli system settings advanced list -o /LSOM/lsomEnableFullRebuildAvoidance

 To disable FRA (required to run the failure tests):

esxcli system settings advanced set -o /LSOM/lsomEnableFullRebuildAvoidance -i 0

Once failure testing is complete, re-enable:

esxcli system settings advanced set -o /LSOM/lsomEnableFullRebuildAvoidance -i 1

 
It should be noted that the same tests can be run by simply removing the disk from the host. If physical access to the host is convenient, literally pulling a disk would test exact physical conditions as opposed to emulating it within the software. 

Also, note that not all I/O controllers support hot unplugging drives. Check the vSAN Compatibility Guide to see if your controller model supports the hot unplug feature. 

 

Storage Device is Removed Unexpectedly

When a storage device is suddenly removed from a vSAN host, all the components residing on the device will go into an ‘absent’ state.

The ‘absent’ state is chosen over ‘degraded’ as vSAN assumes that the device is temporarily unavailable (rather than failed). If the disk is placed back in the server before the cluster services timeout (60 minutes by default), then the state will return to a healthy state without the (expensive) rebuild of data.

Thus:

  • The device state is marked as ‘absent’ in vCenter
  • If the object has a policy that dictates the ‘failures to tolerate’ of one or greater, the object will still be accessible from another host in the vSAN Cluster. It is marked with ‘reduced availability with no rebuild - delay timer’
  • If the same device is available again within the timer delay (60 min. by default), no components will be rebuilt
  • If the timer elapses and the device is still unavailable, components on the removed disk will be built elsewhere in the cluster (if capacity is available). This includes any newly claimed devices
  • If the VM Storage Policy has the ‘failures to tolerate’ set to zero, the object will be inaccessible. To restore the object, the same device must be made available again.

 

In this first example, we shall remove a storage device from the host using the vsanDiskFaultInjection.pyc python script rather than physically removing it from the host.

We shall then ‘replace’ the storage device before the object repair timeout delay expires (default 60 minutes), which will mean that no rebuilding activity will occur during this test.

To start, select a running VM. Then navigate to [vSAN Cluster] > Monitor > Virtual Objects and find the VM from the list and select an object. In the example below, we have selected ‘Hard disk 1’:

Graphical user interface</p>
<p>Description automatically generated

Select View Placement Details to show which hosts the object has components on. In the below example, we he have three components in a vSAN OSA cluster:

Graphical user interface, application</p>
<p>Description automatically generated

The column that we are interested in here is the ‘capacity disk’ identifier and the host it resides on (for an vSAN ESA cluster, there will just be a ‘Disk’ column). Note it may be easier to see by selecting the column toggle (on the bottom left) and selecting the appropriate information to display.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Copy the disk ID string and SSH into the host that contains the component. We can then inject a hot unplug event using the python script:

[root@10.159.16.116:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -u -d naa.5000cca08000aec0
Injecting hot unplug on device vmhba2:C0:T3:L0
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1
vsish -e set /storage/scsifw/paths/vmhba2:C0:T3:L0/injectError 0x004C0400000002

In vCenter, we observe the effects of the action. As expected, the component that resided on that disk on host 10.159.16.116 shows up as absent:

Graphical user interface, application</p>
<p>Description automatically generated

 The ‘Virtual Objects’ page should also show the component state as ‘Reduced availability with no rebuild – delay timer’.

Graphical user interface, application</p>
<p>Description automatically generated

We then simulate the re-insertion of the storage device by performing a SCSI rescan on the affected host. Navigate to the [vSAN host] > Configure > Storage > Storage Adapters and click the Rescan Storage button.

Graphical user interface, text, email, website</p>
<p>Description automatically generated

Return to the Virtual Objects page and observe that the storage device has re-appeared, and the components are healthy. If for some reason, the disk doesn’t return after refreshing the screen, try rescanning the host again. If it still doesn’t appear, reboot the ESXi host.

Once the storage device has returned, clear any hot unplug flags set previously with the –c option:

[root@w1-vxrail-160-02-bl-4:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca08000aec0
vsish -e get /reliability/vmkstress/ScsiPathInjectError
vsish -e set /storage/scsifw/paths/vmhba2:C0:T3:L0/injectError 0x00000
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000 

Note that running the above command again will perform a quick diskgroup unmount/remount, (in a vSAN OSA cluster) which may be needed after some of the failure tests below.

 

Storage Device Removed, Not Replaced Before Timeout

Here, we will repeat the test above, but will leave the device ‘unplugged’ for longer than the timeout. We should expect to see vSAN rebuilding the component to another disk to achieve policy compliance. We begin again by identifying the disk on which the component resides:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Optionally, we can reduce the object repair timer to suit our needs. To do this, navigate to [vSAN Cluster] > Configure > vSAN > Services > Advanced Options > Edit

A screenshot of a computer</p>
<p>Description automatically generated

Then adjust the timer. Here we set it to five minutes, meaning vSAN will wait a total of five minutes before starting any rebuild activities:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

To start the test, run the Python script again, taking note of the date

[root@w1-vxrail-160-02-bl-4:~] date
Fri Oct  7 13:33:58 UTC 2022
[root@w1-vxrail-160-02-bl-4:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -u -d naa.5000cca08000aec0
Injecting hot unplug on device vmhba2:C0:T3:L0
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1
vsish -e set /storage/scsifw/paths/vmhba2:C0:T3:L0/injectError 0x004C0400000002

At this point, we can once again see that the state of the component is set to ‘absent’:

 After the object repair timer has elapsed, vSAN will rebuild the component onto another disk in the cluster:

Graphical user interface, application</p>
<p>Description automatically generated

Again, the ‘unplugged’ storage device can now be re-added by scanning the HBA:

Navigate to the [vSAN host] > Configure > Storage Adapters and click the Rescan Storage button.

Return to the Virtual Objects page to see that the disk has re-appeared, and components are healthy. If for some reason, the disk doesn’t return after refreshing the screen, try rescanning the host again. If it still doesn’t appear, reboot the ESXi host.

Once the disk ID is back, clear any hot unplug flags set previously with the –c option:

[root@w1-vxrail-160-02-bl-4:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca08000aec0
vsish -e get /reliability/vmkstress/ScsiPathInjectError
vsish -e set /storage/scsifw/paths/vmhba2:C0:T3:L0/injectError 0x00000
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

 
If the object repair timer was changed, you can retain the setting for the next section. Otherwise remember to reset it back to the default (60 minutes):

 

Graphical user interface, text, application, email</p>
<p>Description automatically generated

 

vSAN OSA Cache Device is Removed Unexpectedly and Not Replaced

A sudden loss of a vSAN OSA cache device will cause the affected disk group to go offline. All the vSAN components that form part of the disk group will enter an ‘permanent disk loss’ state and are inaccessible.

Thus:

  • If objects on the disk group are assigned a policy that dictates the ‘failures to tolerate’ of one or greater, the objects will still be accessible from another host in the vSAN Cluster (otherwise they will remain inaccessible).
  • The storage devices in the disk group will enter a ‘permanent disk loss’ state
    • If the storage device is physically placed back on the same host within the timer period no new objects will be re-built.
  • If the timer elapses and the device is still unavailable, components on the failed disk group will be built elsewhere in the cluster (if capacity is available). This includes any newly claimed devices

 

In this test we will remove a vSAN OSA cache device from one of the disk groups in the cluster. As before, we can reduce the object repair timer (from the default of 60 minutes) to suit our needs – see the previous test.

Navigate to the [vSAN cluster] > Configure > vSAN > Disk Management. Select a host and click on ‘view disks’:

Graphical user interface, application</p>
<p>Description automatically generated
 In the next screen, copy the device ID from a ‘vSAN Cache’ device:

Graphical user interface, text, application</p>
<p>Description automatically generated

Next, we will inject a hot unplug event. As before:

[root@10.159.16.115:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -u -d naa.5000cca04eb03fbc
Injecting hot unplug on device vmhba2:C0:T1:L0
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1
vsish -e set /storage/scsifw/paths/vmhba2:C0:T1:L0/injectError 0x004C0400000002

 

Looking back at the disk group in vCenter, we can see the whole group is lost:

A screenshot of a computer</p>
<p>Description automatically generated
Moreover, any components that were residing on that disk group are "Absent".

Graphical user interface, application</p>
<p>Description automatically generated

After the object repair timer has lapsed, vSAN should start rebuilding/resyncing components on different drives (as we saw earlier) in place of the absent components. To display details on resyncing components, navigate to [vSAN cluster] >   Monitor > vSAN > Resyncing Objects.

We can then re-add the SSD logical device back to the host by rescanning the HBA: Navigate to the [vSAN host] > Configure > Storage > Storage Adapters and click the Rescan Storage button.

As before, if for some reason, the device doesn’t return after refreshing the screen, try rescanning the host again. If it still doesn’t appear, reboot the ESXi host. Once the device ID is back, clear any hot unplug flags set previously with the –c option:

[root@10.159.16.116:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca04eb03fbc
vsish -e get /reliability/vmkstress/ScsiPathInjectError
vsish -e set /storage/scsifw/paths/None/injectError 0x00000
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

Warning: If you delete a device that was marked as an SSD, and a logical RAID 0 device was rebuilt as part of this test, you may have to mark the drive as an SSD once more.
 

Injecting ‘Permanent’ Disk Error on a Device

If a disk drive has an unrecoverable error, vSAN marks the device as ‘degraded’ as the failure is permanent.

  • If the object has a policy that dictates the ‘failures to tolerate’ of one or greater, the object will still be accessible from another host in the vSAN Cluster.
  • The disk state is marked as ‘degraded’ in vCenter
  • If the VM Storage Policy has the ‘failures to tolerate’ set to zero, the object will be inaccessible. This will require a restore of the VM from a known good backup.

 

An vSAN OSA cache device failure follows a similar sequence of events to that of a storage device failure with one major difference; vSAN will mark the entire disk group as ‘degraded’. As the failure is permanent (disk is offline) it is no longer visible.

  • If the object has a policy that dictates the ‘failures to tolerate’ of one or greater, the object will still be accessible from another host in the vSAN Cluster.
  • Disk group and the disks under the disk group states will be marked as ‘degraded’ in vCenter
  • If the VM Storage Policy has the ‘failures to tolerate’ set to zero, the object will be inaccessible. This will require a restore of the VM from a known good backup.

 

In this test, we use the Python script (detailed above) to mark a device as failed. 

First, select a VM in the cluster and navigate to Monitor > vSAN > Physical disk placement. Adjust the columns to show the hosts and device IDs:

Graphical user interface, text, email</p>
<p>Description automatically generated

Copy the device ID of a component. As before, we can then SSH to the host and run the fault injection Python script:

[root@10.159.16.118:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -p -d naa.5000cca08000b3dc
/usr/lib/vmware/osfs/bin/objtool create -s 4M -p "((\"hostFailuresToTolerate\" i0) (\"subFailuresToTolerate\" i0) (\"stripeWidth\" i4) (\"forceProvisioning\" i0) (\"locality\" \"HostLocal\"))"
Injecting permanent error on device vmhba2:C0:T3:L0 for write command
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1
vsish -e set /storage/scsifw/paths/vmhba2:C0:T3:L0/injectError 0x000003032a000002
vsish -e set /config/LSOM/intOpts/plogRunElevator 1
dd if=/dev/urandom of=/vmfs/devices/vsan/1b564063-4491-e84c-1c75-2c600c990158 count=100 conv=notrunc

100+0 records in
100+0 records out

vsish -e set /config/LSOM/intOpts/plogRunElevator 0
/usr/lib/vmware/osfs/bin/objtool delete -f -u 1b564063-4491-e84c-1c75-2c600c990158

 

Back on the UI, we can see the component is absent, as expected:

Graphical user interface</p>
<p>Description automatically generated
Navigating to [vSAN cluster] > Configure > vSAN > Disk Management should show the disk with an error, and the disk group should enter an unhealthy state:

Graphical user interface</p>
<p>Description automatically generated
Clicking on ‘view disks’ should show the disk with the permanent error:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

This should place any components on the disk into a degraded state (which can be observed via the "Physical Placement" window) and initiate an immediate rebuild of components. There is no object repair delay timer as vSAN can see that this is a permanent failure. Navigating to [vSAN cluster] > Monitor > vSAN > Resyncing Components should reveal the components resyncing.

At this point, we can clear the error, as before:

[root@10.159.16.118:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca08000848c
vsish -e get /reliability/vmkstress/ScsiPathInjectError
vsish -e set /storage/scsifw/paths/vmhba2:C0:T2:L0/injectError 0x00000
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

Running the same command for a second time will unmount/remount the disk group and should bring everything back to a healthy state:

[root@10.159.16.118:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca08000b3dc
vsish -e get /reliability/vmkstress/ScsiPathInjectError
vsish -e set /storage/scsifw/paths/vmhba2:C0:T3:L0/injectError 0x00000
vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000
[root@localhost:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca08000b3dc
vsish -e get /reliability/vmkstress/ScsiPathInjectError
Clearing health on device naa.5000cca08000b3dc
esxcfg-advcfg --default /LSOM/lsomDeviceMonitorInterval
esxcfg-advcfg --default /LSOM/lsomSlowDeviceLatencyTimePeriod
esxcfg-advcfg --default /LSOM/lsomSlowDeviceLatencyIntervals
vsish -e get /storage/scsifw/devices/naa.5000cca08000b3dc/info
vsish -e set /vmkModules/plog/devices/naa.5000cca08000b3dc/health/movingAverageWriteMinimumIOsThreshold 300
vsish -e set /vmkModules/plog/devices/naa.5000cca08000b3dc/health/movingAverageWriteLatencyThreshold 2500000
vsish -e set /vmkModules/plog/devices/naa.5000cca08000b3dc/health/movingAverageLatencyInterval 600
vsish -e get /vmkModules/plog/devices/naa.5000cca08000b3dc/info
vsish -e get /vmkModules/plog/devices/naa.5000cca08000b3dc/info
esxcli vsan storage diskgroup unmount -d naa.5000cca08000b3dc
esxcli vsan storage diskgroup mount -d naa.5000cca08000b3dc

If the script does not bring the device back online, it can be manually, and re-added from the disk group. This is very simple to do:

Select the disk in the disk group and remove it by clicking on ‘remove disk’:

Graphical user interface, text, application</p>
<p>Description automatically generated

Confirm the removal of the disk on the next screen.

Then add the disk back by clicking on the ellipses (the three vertical dots) near the down arrow and selecting ‘Add Disks’

Graphical user interface, text, application</p>
<p>Description automatically generated

 

When Might a Rebuild of Components Not Occur?

There are a couple of reasons why a rebuild of components might not occur. Start by looking at vSAN Health Check UI [vSAN cluster] > Monitor > vSAN > Health for any alerts or failures. 

Lack of Resources

Verify that there are enough resources to rebuild components before testing with the simulation tests detailed in the previous section.

Of course, if you are testing with a cluster size that cannot satisfy the ‘failures to tolerate’ defined in the hosted storage policies, and a failure is introduced, there will be no rebuilding of objects as the polices cannot be satisfied.

Underlying Failures

Another cause of a rebuild not occurring is due to an underlying failure already present in the cluster. Verify there are none before testing by checking the health status of the cluster.

 

 

Air-gapped Network Failures

Air-gapped vSAN network design is built around the idea of redundant, yet completely isolated storage networks. It is used in conjunction with multiple VMkernel interfaces tagged for vSAN traffic, where each VMkernel interface is on different VLANs/subnets. Thus, there is physical and logical separation of network. A primary use case is to separate the IO data flow onto redundant data paths. Each path is then independent, and failure of one does not affect the other.

Note:  In vSphere 8.0 and above, the system will attempt to (round-robin) balance the traffic between the two VMkernel adaptors. Prior to this release,  the purpose is soley to tolerate link failure across redundant data paths.

The figure below shows the vmnic uplinks on each host, physically separated by connecting to different switches (and thus networks). The VMkernel ports are logically separated on separate VLANs (in different port groups on the distributed switch). Therefore, each host has a separate, redundant network path:

Diagram, schematic</p>
<p>Description automatically generated

 

 

The table below shows the IP, VLAN and uplink details. Again, note that there is one uplink per VMkernel adaper. Each VMkernel adapter is on a separate VLAN.

 

vSAN VMkernel

IP address

Port group name

VLAN

Port group uplinks

vmk1

192.168.201.0/24

VLAN-201-vSAN-1

 201 

Uplink 1

vmk2

192.168.202.0/24

VLAN-202-vSAN-2

 202 

Uplink 2 

 

 

Failover Test Scenario using DVS Portgroup Uplink Priority

Before we initiate a path failover, we need to generate some background workload to maintain a steady network flow through the two VMkernel adapters. You may choose your own workload tool or initate a HCIbench workload set.

Using the functionality in DVS, we can simulate a physical switch failure or physical link down by moving an "active" uplink for a port group to "unused" as shown below. This affects all VMkernel ports that are assigned to the port group.


 

Expected outcome on vSAN IO traffic failover

Prior to vSAN 6.7, when a data path is down in air-gapped network topology, VM IO traffic could pause up to 120 seconds to complete the path failover while waiting for the TCP timeout signal. Starting in vSAN 6.7, failover time improves significantly to no more than 15 seconds as vSAN proactively monitors failed data path and takes corrective action as soon as a failure is detected.

Monitoring network traffic failover

To verify the traffic failover from one VMkernel interface to another, and capture the timeout window, we open an SSH session to each host and use the  esxtop utility. Press "n" to actively monitor host network activities before and after a failure is introduced.

The screenshot below illustrates that the data path through vmk2 is down when the "unused" state is set for the corresponding uplink ("void" status is reported for that physical uplink). Notice that TCP packet flow has suspended on that VMkernel interface (as zeroes are reported under the Mb/s transmit (TX) and receive (RX) columns).


 

It is expected that vSAN health check reports failed pings on vmk2 as we set uplink 1 to "Unused".

 

To restore the failed data path after a failover test, modify the affected uplink from "unused" back to "active". Network traffic should be restored through both VMkernel interfaces (though not necessarily load-balanced).

Before testing other scenarios, be sure to remove the second VMkernel interface on each host and perform a vSAN health check and ensure all tests pass.

 

APPENDIX A: PCI HOTPLUG

NVMe has helped usher in all-new levels of performance capabilities for storage systems. vSphere 7 introduced hotplug support for NVMe devices. Consult the vSAN HCL to verify supportability and required driver and firmware versions 

vSphere 7.0 and above follow the standard hot plug controller process and can be categorized into two processes, surprised and planned PCIe device hot-add.

Surprise Hot-Add

The device is inserted into the hot-plug slot without prior notification: without the attention button or software interface (UI/CLI) mechanism.

Step

User Action

ESXi Action

Power Indicator

1

User selects an empty, disabled slot and inserts a PCIe device

Platform/PCI hotplug layer detects the new additional hardware and notifies the ESXi device manger to scan for hot-added devices.
In case of any failure, the Power Indicator goes OFF.

BLINKS

2

User waits for the slot to be enabled

PCI bus driver enumerates the hot-added device and registers it with the vSphere device manager

ON

Planned Hot-Add

Step

User Action

ESXi Action

Power Indicator

1

User selects an empty, disabled slot and inserts a PCIe device

 

OFF

2

User presses attention button / issues software UI/CLI command to enable the slot

In case of software interface (UI/CLI), there is no provision to abort a hot-add request, so once the command is issued control directly jumps to Step 4

In case of attention button, PCIe hotplug layer waits for ABORT INTERVAL (=5sec)

BLINKS

3

User cancels the operation by pressing the attention button a second time within ‘abort interval’

If canceled, the Power Indicator goes back to previous state OFF

OFF

4

No user action in the ‘abort interval’

PCIe hotplug layer validates the hot-add operation, powers the slot. On success, it notifies the ESXi device manager to scan for the hot-added device(s). in case of any failure, the Power Indicator goes back to previous state OFF

BLINKS

5

User waits for the slot to be enabled

PCI bus driver enumerates the hot-added device and registers it with the ESXi device manager.

ON

 

Note: After these steps, the ESXi device manager attaches the devices to the driver and the storage stack, presents the HBA, and the associated disk(s) to the upper layer, for example vSAN/VMFS.
 

Surprise Hot-Remove

In this case, the drive is removed without any prior notification through attention button or UI/CLI. If the user did not run preparatory steps, data consistency cannot be guaranteed. In the case of failed drives, the scenario is the same as abrupt removal without the preparatory steps, in which case no data consistency can be guaranteed.

Step

User Action

ESXi Action

Power Indicator

1

User selects an enabled slot with a PCIe device to be removed.

ESXi executes the requested preparatory steps for the drive corresponding to this device and flags as an error if unable to perform any step.
User can choose to skip preparatory steps and directly remove the device in which case data consistency cannot be guaranteed.

ON

2

User removes the PCIe device

Platform/PCIe hot-unplug layer detects the device removal and notifies the ESXi device manager to remove the device. In case of any failure, the Power Indicator goes OFF.
ESXi device manager issues a series of quiesce instructions, detach from all the drivers (storage stack, device driver, etc...), and finally remove the PCI bus driver. In case of any failure, the Power Indicator goes back to the previous state ON indicating that the device cannot be removed.

BLINKS

3

User waits for the slot to become disabled

PCIe bus driver removes the device from the system and power down the PCI slot.

OFF

Planned Hot-Remove

It is expected that the user runs the preparatory steps to ensure the data consistency, before initiating hot remove operation via the attention button/software interface (UI/CLI). Even in this case, if the user does not run preparatory steps, data consistency cannot be guaranteed.

 

Step

User Action

ESXi Action

Power Indicator

1

User selects an enabled slot with a PCIe device to be removed and initiates preparatory steps.

ESXi executes the requested preparatory steps for the drive corresponding to this PCIe device and flags an error if unable to perform any step.

ON

2

User presses Attention Button/issues software UI command to disable the slot

In the case of software interface (UI/CLI), there is no provision to abort the hot-remove request, so once the command is issued, control directly jumps to Step 5.
PCI Hot-unplug layer gets an interrupt and waits for ABORT INTERVAL (= 5 seconds).

BLINKS

3

User can cancel the operation by pressing the Attention Button a second time

The Power Indicator goes back to previous state ON

ON

4

No user action in the ‘abort interval’

PCI Bus driver removes the device from the system and power down the slot.

OFF

5

User waits for the slot to be disabled

PCI Bus driver removes the device from the system and power down the slot.

OFF

6

User removes the PCIe device

 

OFF

 

For more information on PCI hotplug, visit:

https://kb.vmware.com/s/article/2030818
https://kb.vmware.com/s/article/74726
https://kb.vmware.com/s/article/78390
https://kb.vmware.com/s/article/78297

Filter Tags

Storage ESXi 8 vSAN vSAN 8 vSphere 8 ESXi vSAN Ready Nodes vSAN Resilience vSphere Client Document Feature Walk-through Operational Tutorial Proof of Concept Intermediate Deploy Manage Optimize