vSAN 70u3 Proof of Concept Guide

What’s New in 7.0U3 Proof-of-Concept Guide

vSAN 7.0U3 brings some exciting new changes! This guide includes information around new features such as vSAN over RDMA, vSAN File Services snapshot support, expansion of HCI Mesh, Native KMS for encryptions, Shared Witness and Witness vLCM among others. Also included is information around updated enhancements around vLCM, capacity reporting, planning, alerting, time-based health checks, and advanced networking metrics.

Introduction and prerequisites

Decision-making choices for vSAN architecture.

Plan on testing a reasonable hardware configuration resembling a production-ready environment that suits your business requirements. Refer to the VMware vSAN Design and Sizing Guide for information on design configurations and considerations when deploying vSAN. The hardware you plan to use should be listed on the VMware Compatibility Guide (VCG): this is critical to ensure success and supportability. If you're using a vSAN ReadyNode or VxRail appliance, the factory-installed hardware is guaranteed to be compatible with vSAN, however, BIOS updates, and firmware and device driver versions may be out of date and should be checked for alignment to the VCG. For vSAN software layer specifically, pay particular attention to the following areas of the VCG:

BIOS

Choose "Systems / Servers" from "What are you looking for": http://www.vmware.com/resources/compatibility/search.php

Network cards

Choose "IO Devices" from "What are you looking for" and select "Network" from "I/O Device Type" field: https://www.vmware.com/resources/compatibility/search.php?deviceCategory=io

vSAN Storage I/O controllers & disks

Choose "vSAN" from "What are you looking for".
Scroll to 'STEP 3' and look for the link to "Build Your Own based on Certified Components":
https://www.vmware.com/resources/compatibility/search.php?deviceCategory=vsan

From the Build Your Own page, choose the appropriate device (i.e. I/O controller, SDD or HDD) to search for:

The following commands are useful to help identify firmware and drivers in ESXi for comparison with the VCG. First, log in to an ESXi host via SSH, then run the following commands to obtain the information from the server:

To see the controller details:

esxcli vsan debug controller list

To list VID DID SVID SSID of a storage controller or network adapter:

vmkchdev -l | egrep 'vmnic|vmhba'

To show which NIC driver is loaded:

esxcli network nic list

To show which storage controller driver is loaded:

esxcli storage core adapter list

To display a driver version information:

vmkload_mod -s <driver-name> | grep -i version

ForNVMe devices (replace X with the appropriate value):

esxcli nvme device get -A vmhbaX | egrep "Serial Number|Model Number|Firmware Revision"

All-Flash or Hybrid

There are several factors to consider if you plan to deploy an All-Flash vSAN solution:

All-Flash vSAN requires a 10Gb Ethernet network for the vSAN traffic; it is not supported with 1Gb NICs
Flash devices are used for both cache and capacity
Deduplication and Compression are space-efficiency features available in all-flash configuration and not available with hybrid configuration
Erasure Coding (RAID 5/ RAID 6) is a space efficiency feature available on all-flash configuration only
Flash read cache reservation is not used with all-flash configurations; reads come directly from the capacity tier SSDs
Endurance and performance classes now become important considerations for both cache and capacity layers

vSAN POC Setup Assumptions and Pre-Requisites

Prior to starting the proof of concept, the following pre-requisites must be completed. The following assumptions are being made with regards to the deployment:

N+1 servers are available and compliant with the vSAN HCL
All servers have had ESXi 7.0u3a (build number 18825058) or newer deployed
vCenter Server 7.0u3a (build number 18778458) or newer has been deployed to manage these ESXi hosts (vCenter deployment procedures will not be covered in this POC guide)
If possible, configure internet connectivity for vCenter such that the HCL database can be updated automatically. Internet connectivity is also a requirement to enable Customer Experience Improvement Program (CEIP), which is enabled by default to benefit customers with faster issue resolution by VMware support
Services such as DHCP, DNS, and NTP are available in the environment where the POC is taking place
All but one of the ESXi hosts should be placed in a new cluster container in vCenter (this host is set aside for cluster expansion testing)
The cluster must not have any features enabled, such as DRS, HA or vSAN. These will be configured throughout the course of the POC
Each host must have a management network VMkernel and a vMotion network VMkernel already configured. Initially, vSAN network VMkernel adapters are not configured. These will be configured later
For the purposes of testing Storage vMotion operations, an additional datastore type, such as NFS or VMFS, should be presented to all hosts. (This is an optional POC exercise)
A set of IP addresses, one per ESXi host will be needed for the vSAN network VMkernel ports. The recommendation is that these are all on the same VLAN and IPv4 or IPv6 network segment.

vSAN POC Overview

Hardware Selection

Choosing the appropriate hardware for a POC is one of the most important factors in the successful validation of vSAN. Below is a list of the more common options for vSAN POCs:

Build Your Own: This method allows customers to repurpose a subset of their existing infrastructure hardware to evaluate vSAN. This method helps expedite the process and reduce the time taken in procuring the relevant hardware.
Virtual POCs: Organizations solely interested in seeing vSAN functionality may be interested in the Virtual POC. This is a virtual environment and is not a true test of performance or hardware compatibility but can help stakeholders feel more comfortable using vSAN. Please contact your VMware HCI specialist to take advantage of our “Test Drive” environment.
Hosted POCs: Many resellers, partners, distributors, and OEMs recognize the power of vSAN and have procured hardware to make it available to their customers in order to be able to conduct vSAN proof of concept tests.
Try and Buy: Whether a VxRail or a vSAN ReadyNode, many partners will provide hardware for a vSAN POC as a “try and buy” option.

Choosing the appropriate hardware for a POC is vitally important. There are many variables with hardware (drivers, controller firmware versions) so be sure to choose hardware that is on the VMware Compatibility List.

Once the appropriate hardware is selected it is time to define the POC use case, goals, expected results and success metrics.

POC Validation Overview

The most important aspects to validate in a Proof of Concept are:

Successful vSAN configuration and deployment
VMs successfully deployed to vSAN Datastore
Reliability: VMs and data remain available in the event of failure (host, disk, network, power)
Serviceability: Maintenance of hosts, disk groups, disks, clusters
Performance: vSAN and selected hardware can meet the application, as well as business needs
Validation: vSAN features are working as expected (File Service, Deduplication and Compression, RAID-5/6, checksum, encryption)
Day-2 Operations: Monitoring, management, troubleshooting, and upgrades

These can be grouped into three common vSAN POCs: resiliency testing, performance testing, and operational testing.

Operational Testing

Operational testing is a critical part of a vSAN POC. Understanding how the solution behaves on during normal (or “day two”) operations is important to consider as part of the evaluation. Fortunately, because vSAN is embedded within the ESXi hypervisor, many vSAN operations tasks are simply extensions of normal vSphere operations. Adding hosts, migrating VMs between nodes, and cluster creation are some of the many operations that are consistent between vSphere and vSAN.

Some of the Operational Tests include:
Adding hosts to a vSAN Cluster
Adding disks to a vSAN node
Create/Delete a Disk Group
Clone/vMotion VMs
Create/edit/delete storage policies
Assign storage policies to individual objects (VMDK, VM Home)
Monitoring vSAN
- Embedded vROps (vSAN 6.7 and above)
- Performance Dashboard on H5 client
- Monitor Resync components
- Monitor via vRealize Log Insight
Put vSAN nodes in Maintenance Mode
Evacuate Disks
Evacuate Disk Groups

For more information about operational tests please visit the following sections on the vSAN POC Guide:

Performance Testing

Performance testing often receives a lot of attention during vSAN POCs, so it's important to understand the performance requirements of the environment and pay close attention to details such as workload I/O profiles. Prior to conducting a performance test, first develop clear objectives, and understand whether a synthetic performance test is an appropriate benchmark. For a detailed description of recommended performing testing methods, please refer to the ‘Performance and Failure Testing’ section of this guide.

Resiliency Testing

vSAN is designed to protect data availability. By default, vSAN protects availability with 2 replicas of data, based on the vSAN default storage policy. As the number of nodes increases, you are presented with the option to further protect your data from multiple failures by increasing the number of data replicas. With a minimum of 7 nodes, you can have up to 4 data replicas, protecting against up to 3 failures at once, while still maintaining VM availability. For the sake of simplicity, we will keep the vSAN default storage policy in mind for any examples given in this guide (unless otherwise specified).

As with any other storage solution, failures can occur on different components at any time due to age, temperature, firmware, etc. Such failures can occur among storage controllers, disks, nodes, and network devices among other components. A failure on any of these components may manifest itself as a failure in vSAN if redundancies (e.g., multiple network paths) are not implemented.

When a failure occurs, vSAN components that comprise objects on the datastore will go into an “absent” or a “degraded” state. Depending on the component state, they will either rebuild immediately (if degraded), or wait the object repair timer to expire (if absent). By default, the repair delay value is set to 60 minutes. This timer exists because in situations where components are simply ‘absent’ (e.g. due to a host reboot), it may be more advantageous to wait for the absent components to return to the cluster, allowing vSAN to simply perform a ‘delta’ sync to catch-up these objects to present-state, rather than fully rebuilding a copy of these components elsewhere.

One common test is to physically remove a drive from a live vSAN node. In this scenario, vSAN sees the drive is missing, issues an alert about drive failure, but doesn’t know if the missing drive will return. In this scenario, objects on the drive are put in an absent state. vSAN notes the "absent" state and initiates a 60-minute repair timer countdown. If the drive does not come back within the time specified, vSAN will rebuild the objects to get back into policy compliance. If the drive was pulled by mistake, and is replaced within the 60 minutes, there is no rebuild, and after quick sync of data, the objects will be marked as healthy again.

In cases of a drive failure (permanent device loss, or PDL), the disk and its associated vSAN components are marked as degraded. vSAN will receive error codes from the hardware layer, mark the drive as degraded, and begin the repair of vSAN objects immediately.

Each type of object failure may be tested. and the associated object and component states observed during the POC. There is a Python script available within ESXi that allow injection of various error codes to generate both absent and degraded component states. This python script is called vsanDiskFaultInjection.pyc. You can see the usage of this script below.

Apart from disk failure testing, we also recommend including the following tests to better understand the resiliency features provided by vSAN:

Simulate node failure with HA enabled
Introduce network outage
- With and without network path redundancy
Physical cable pull
Network switch failure
vCenter failure

vSAN Network Setup

How to configure vSAN Network Settings

Note: Optionally skip to 'Using Quickstart' in the next chapter to quickly configure a new cluster and enable vSAN

Before vSAN can be enabled, all but one of the hosts must be added to the cluster and assigned management IP addresses; the final host is reserved for later testing of adding a host to the cluster. All ESXi hosts in a vSAN cluster communicate over a vSAN network. For network design and configuration best practices please refer to the VMware vSAN Design Guide on https://core.vmware.com/resource/vmware-vsan-design-guide#section1 .

The following example demonstrates how to configure vSAN networking services on an ESXi host.

Creating a VMkernel Port for vSAN

In many deployments, vSAN may be sharing the same uplinks as the vSphere management and vMotion traffic, especially when 10GbE NICs are utilized. When sharing vSAN traffic with other traffic types, VMware recommends using virtual distributed switches. Using a virtual distributed switch (vDS) provides Quality of Service (QoS) using a feature called Network I/O Control. Licensing for distributed virtual switches is included with all versions of vSAN.

The assumption for this POC is that there is already a standard vSwitch created and connected to the physical network uplinks that will be used for vSAN traffic. In this example, a separate vSwitch (vSwitch1) with two dedicated 10GbE NICs has been created for vSAN traffic, while the management and vMotion network use different uplinks on a separate standard vSwitch.

To create a vSAN VMkernel port, follow these steps:

Select an ESXi host in the inventory, then navigate to Configure > Networking > VMkernel Adapters. Click on the icon for Add Networking, as highlighted below:

Ensure that VMkernel Network Adapter is chosen as the connection type.

The next step gives you the opportunity to build a new standard vSwitch for the vSAN network traffic. In this example, an already existing vSwitch1 contains the uplinks for the vSAN traffic. If you do not have this already configured in your environment, you can use an already existing vSwitch, or select the option to create a new standard vSwitch.

If your hosts are limited to 2 x 10GbE uplinks, it often makes sense to use the same vSwitch for all traffic types.

As there is an existing vSwitch in our environment that contains the network uplinks for the vSAN traffic, the “BROWSE” button is used to select it as shown below.

Select an existing standard vSwitch via the "BROWSE" button:

Choose a vSwitch.

vSwitch1 is displayed once selected.

The next step is to set up the VMkernel port properties, and choose the services, such as vSAN traffic. This is what the initial port properties window looks like:

Here is what it looks like when enabling the vSAN service on the VMkernel port:

In the above example, the network label has been designated “vSAN Network”, and the vSAN traffic does not run over a VLAN. If there is a VLAN used for the vSAN traffic in your POC, change this from “None (0)” to an appropriate VLAN ID. Note that the usual practice is to dedicate a specific VLAN for vSAN usage.

The next step is to provide an IP address and subnet mask for the vSAN VMkernel interface. As per the assumptions and prerequisites section earlier, you should have these available before you start. At this point, add one per host by clicking on Use static IPv4 settings as shown below. Alternatively, if you plan on using DHCP IP addresses, leave the default setting which is Obtain IPv4 settings automatically.

The final window shows a summary of the VMkernel configuration. Here you can check that everything will be configured as expected. If anything is incorrect, you can navigate back through the wizard to make corrections. If everything looks correct, click on the FINISH button to apply the configuration change.

If the creation of the VMkernel port is successful, it will appear in the list of VMkernel adapters, as shown below.

This completes the vSAN networking setup for that host. This configuration process must be repeated for all other ESXi hosts that will participate in the vSAN cluster, including the host outside the cluster that will be added later.

2-Node Direct Connect

In cases where a 2-node vSAN cluster (with witness appliance) is deployed, a separately tagged VMkernel interface may be used for witness traffic transit instead of extending the vSAN network to the witness host. This feature allows for a more flexible network configuration by allowing for separate networks for node-to-node vs. node-to-witness communication. Note that this capability can only be enabled from the command line.

Witness Traffic Separation provides the ability to directly connect vSAN data nodes in a 2-node configuration; traffic destined for the Witness host can be tagged on an alternative physical interface separate from the directly connected network interfaces carrying vSAN traffic. Direct Connect eliminates the need for a 10Gb switch at remote offices/branch offices where the additional cost of the switch could be cost-prohibitive to the solution.

Enabling Witness Traffic Separation is not available from the vSphere Web Client. For the example illustrated above, to enable Witness Traffic on vmk1, execute the following on Host1 and Host2:

esxcli vsan network ip add -i vmk1 -T=witness

Any VMkernel port not used for vSAN traffic can be used for Witness traffic. In a more simplistic configuration, the Management VMkernel interface, vmk0, could be tagged for Witness traffic. The VMkernel port tagged for Witness traffic will be required to have IP connectivity to the vSAN traffic tagged interface on the vSAN Witness Appliance.

Simplified vSAN Witness setup

Minimum requirement for this particular setup is single vnic uplink and associate vmk0 with vSAN.

VMK1 static routes are no longer required and simplifies the setup requirements to one single vmk/vmnic and uses vmk0 default gateway to reach any endpoints in the network fabric.

Note: Ideal network setup uses routed L3 between vSAN Witness and ESXi branch office hosts.

Shared Witness 2node Robo and vSAN Stretch Cluster

vSAN 7.0u2 introduces vSAN witness for 2node setup and support for vSAN Stretch Cluster was extended in 7.0u3. Using different sizes of the vSAN Witness provides the functionality to share on a single VM multiple vSAN 2node / Stretch Clusters and reduces host resource requirements.

Graphical example

Note: vSAN Witness VM sizes recommendation for tiny/small/large/X-large (link)

Enabling vSAN

Steps to enable vSAN

Using Quickstart

The 'Quickstart' feature streamlines vSAN setup. Either follow this section to configure the cluster or use the next two sections to do so manually. Note that if you deployed vSAN using the bootstrap feature of vCenter deployment, you will not be able to use quickstart to configure your environment.

After creating a new cluster, you are presented with a dialog to edit settings. Provide a name for the cluster and select vSAN from the list of services:

The Quickstart screen is then displayed.

The next step is to add hosts. Clicking on the 'Add' button on the 'Add hosts' section presents the dialog below. Multiple hosts can be added at the same time (by IP or FDQN). Additionally, if the credentials of every host are the same, tick the checkbox above the list to quickly complete the form:

Once the host details have been entered, click next. You are then presented with a dialog showing the thumbprints of the hosts. If these are as expected, tick the checkbox(es) and then click next:

A summary will be displayed, showing the vSphere version on each host and other details. Verify the details are correct and click next:

Finally, review and click Finish if everything is in order:

After the hosts have been added, validation will be performed automatically on the cluster. Check for any errors and inconsistencies and re-validate if necessary:

The final step is to configure the cluster. After clicking on 'Configure' on step 3, the following dialog allows for the configuration of the distributed switch(es) for the cluster. Leave the default 'Number of distributed switches' set to 1 and assign a name to the switch:

Scroll down and configure the port groups and physical adapters as needed, then click next:

On the next screen, set the VLAN and IP addresses to be used, then click next:

Select the type of deployment: standard or stretched cluster. Enable any extra features, such as deduplication and compression. Check everything is correct and click next:

If possible, disks will be automatically selected as cache or capacity. Check the selection and click next:

Configure the fault domains, as required, then click next:

If stretched cluster was selected, the fault domain selection will look slightly different. Select the appropriate hosts for each fault domain and click next:

For stretched cluster, chose the witness host, then click next:

Select the disks for the witness host. Check and click next:

Finally, check everything is as expected and click finish:

Wait for the cluster and hosts to be configured correctly then proceed to the next chapter.

Enabling vSAN

Once all the pre-requisites have been met, vSAN can be configured. To enable vSAN complete the following steps:

Open the vSphere HTML5 Client at https://<vcenter-ip>/ui.
Click Menu > Hosts and Clusters.
Select the cluster on which you wish to enable vSAN.
Click the Configure tab.
Under vSAN, Select Services and click the CONFIGURE button to start the configuration wizard.
If desired, Stretched Cluster or 2-Node cluster options can be created as part of the workflow. As part of the basic configuration keep the default selection of Single site cluster and click NEXT.
When using an All-Flash configuration, you have the option to enable deduplication and compression features. Deduplication and compression are covered in a later section of this guide.
If encryption of data at rest is a requirement, here is where encryption can be enabled from the start. We will address encryption later in this POC guide.

Note: The process of later enabling deduplication and compression or encryption of data at rest can take quite some time, depending on the amount of data that needs to be migrated during the rolling reformat. In a production environment, if deduplication and compression is a known requirement, it is advisable to enable this while enabling vSAN to avoid multiple occurrences of rolling reformat.

Click NEXT
In the next screen, you can claim all the disks of the same type for either vSAN caching tier or capacity tier. For each listed disk make sure it is listed correctly as a Flash/HDD, and Caching/Capacity drive. Click NEXT
If desired, created fault domains
Verify the configuration and click FINISH

This completes the configuration process and can be validated by navigating to [Configure > vSAN > Services]

Enable HCI Mesh (If using multiple vSAN enabled clusters)

HCI Mesh (Datastore Sharing) is a new feature in vSAN 7U1, enabled with Enterprise edition or higher licensing. vSAN storage can now be shared between two clusters, utilizing vSAN’s native data path for cross-cluster connections.

Each client cluster can mount a maximum of 5 remote vSAN datastores, and a server cluster can export its datastore up to a maximum of 5 client clusters.

MTU size must be kept consistent across datastore connections.

All vSAN features are supported except for Data-in-Transit encryption, Cloud Native Storage (including vSAN Direct), Stretched Clusters, and 2-Node Clusters. Additionally, HCI Mesh will not support remote provisioning of File Services Shares, iSCSI volumes, or First-Class Disks (FCDs). File Services, FCDs, and the iSCSI service can be provisioned locally on clusters participating in a mesh topology but may not be provisioned on a remote vSAN datastore.

The same MTU sizing is required for both the Client and Server clusters.

New in vSAN 7.0U2 HCI Mesh now supports Compute Only Clusters. Compute only clusters would not require a vSAN license to consume a vSAN datastore on another cluster. Only the cluster providing the storage would need the vSAN license. HCI Mesh now supports up to 128 hosts connected with the previous maximum being 64.

To enable this feature:

Click the cluster that you want to add the remote datastore.
Click Configure > Datastore Sharing > Mount Remote Datastore.
Select the remote datastore that you want to use, then Click Next
On the Check Compatibility screen click Finish
You should now see your new remote datastore available for use.

Check Your Network Thoroughly

Once the vSAN network has been created and vSAN is enabled, you should check that each ESXi host in the vSAN cluster is able to communicate to all other ESXi hosts in the cluster. The easiest way to achieve this is via the vSAN Health Check.

Why Is This Important?

vSAN is dependent on the network: its configuration, reliability, performance, etc. One of the most frequent causes of requesting support is either an incorrect network configuration or the network not performing as expected.

Use Health Check to Verify vSAN Functionality

Running individual commands from one host to all other hosts in the cluster can be tedious and time-consuming. Fortunately, since vSAN 6.0, vSAN has a health check system, part of which tests the network connectivity between all hosts in the cluster. One of the first tasks to do after setting up any vSAN cluster is to perform a vSAN Health Check. This will reduce the time to detect and resolve any networking issue, or any other vSAN issues in the cluster.

To run a vSAN Health Check, navigate to [vSAN cluster] > Monitor > vSAN > Health and click the RETEST button.

In the screenshot below, one can see that each of the health checks for networking has successfully passed.

If any of the network health checks fail, select the appropriate check and examine the details pane on the right for information on how to resolve the issue. Each detailed view under the Info tab also contains an AskVMware button where appropriate, which will take you to a VMware Knowledge Base article detailing the issue, and how to troubleshoot and resolve it.

Before going any further with this POC, download the latest version of the HCL database and run a RETEST on the Health check screen. Do this by selecting vSAN HCL DB up-to-date health check under the "Hardware compatibility" category, and choosing GET LATEST VERSION ONLINE or UPDATE FROM FILE... if there is no internet connectivity.

The Performance Service is enabled by default. You can check its status from [vSAN cluster] > Configure > vSAN > Services. If it needs to be enabled, click the EDIT button next to Performance Service and turn it on using the defaults. The Performance Service provides vSAN performance metrics to vCenter and other tools like vRealize Operations Manager.

To ensure everything in the cluster is optimal, the Health service will also check the hardware against the VMware Compatibility Guide (VCG) for vSAN. Verify that the networking is functional, that there are no underlying disk problems or vSAN integrity issues. The desired goal is to have all the health checks succeed.

At this point, vSAN is successfully deployed. The remainder of this POC guide will involve various tests and error injections to show how vSAN will behave under these circumstances.

Enable vSAN RDMA

RoCEv2 support for vSAN was introduced in vSphere 7.0U2 and improves performance across hosts in the vSAN cluster.

RDMA RoCEv2 protocol is used instead of the default TCP/IP flow to leverage best possible, low latency between two physical endpoints.

NIC adapters can use both RDMA and TCP/IP at the same time which allows the NIC to do more work.

RDMA also reduces the 3-way handshake with TCP/IP protocol and lowers the network transport latency between two physical entities.

Requirements:

Network card support for RDMA RoCEv2
Switch support for RDMA RoCEv2
Switch configuration for PFC
No vSAN Stretch Cluster or 2node support
No LACP configured on network uplinks
Enable RDMA function for vSAN

Verify Network cards for RDMA support:

RDMA support and flag are required for the final setup to enable vSAN RDMA transport.

Note: If the RDMA flag is not visible, please verify NIC driver/firmware, vSphere HCL and hardware vendor specification. NIC driver automatically identifies RDMA capabilities on the physical switch and if not available, please contact your Switch vendor for assistance.

vSphere vSAN RDMA is using RoCEV2 as its protocol layer. When there is no RDMA support available on the physical link or setup, communication falls back to standard legacy TCP/IP automatically.

Setup Process:

- Switch setup

- ESXi host to verify RDMA functionality

- Enable vSAN RDMA

Switch configuration example – setup:

Mellanox Switch SN2100:

PFC (RDMA) support to be enabled
PFC (RDMA) support across switch (ISLs) to be enabled
PFC priority 3 or 4

Enable PFC on switch

switch01 [standalone: master] (config) # dcb priority-flow-control enable

This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm enable pfc globally: yes

Enable Switch PFC for priority 4

dcb priority-flow-control priority 4 enable

Assign PFC to port (ESXi uplink) – example eth1/9:

switch01 [standalone: master] (config) # interface ethernet 1/9 dcb priority-flow-control mode on force

Verify RDMA available adapter through ESXi shell:

esxcli rdma device list

Name Driver State MTU Speed Paired Uplink Description ------- ---------- ------ ---- -------- ------------- ----------- vmrdma0 nmlx5_rdma Active 4096 100 Gbps vmnic4 MT27700 Family
vmrdma1 nmlx5_rdma Active 4096 100 Gbps vmnic5 MT27700 Family

Looking at the vmrdmaX virtual RDMA adapter details provides detailed information on state, MTU size (see hardware specific documentation) and the linked adapter.

Note: To take advantage of RDMA you must have jumbo frames enabled on the physical switch. The RDMA adapter provides <= 4096 (maximum) MTU size.

Verify ESXi RDMA PFC status in esxcli – example vmnic4:

esxcli network nic dcb status get -n vmnic4

Nic Name: vmnic4

Mode: 3 - IEEE Mode
Enabled: true
Capabilities:
Priority Group: true
Priority Flow Control: true
PG Traffic Classes: 8
PFC Traffic Classes: 8
PFC Enabled: true PFC Configuration: 0 0 0 0 1 1 0 0

In case an error was received during command execution, verify driver/firmware combination as per vSphere HCL (link).

esxcli network nic dcb status get --nic-name=vmnic5

DCB not supported for NIC vmnic5: Unable to complete Sysinfo operation. Please see the VMkernel log file for more details.: Not supported

Note: vSAN Health check invokes a similar process to query the device DCB status

Verify ESXi RDMA available protocols:

esxcli rdma device protocol list

Device   RoCE v1 RoCE v2 iWARP ------- ------- ------- -----
vmrdma0     true     true false
vmrdma1     true     true false

Verify vSAN health check in Virtual Center:

Example: PFC is set not to 3 or 4

Example: PFC is not enabled on Switch

Enable vSAN Network for RDMA

Verify with esxtop virtual RDMA adapter performance

Login into ESXi via ssh
Esxtop
- Press “r” (rdma view)

During load we can identify vmrdma0 performance:

Press “n” (network view):

No vmk traffic during IO workload to verify

During our comparison, we can verify full function of the RDMA functions for vSAN during IO workload.

Verify functionality of RDMA and TCP/IP on the same physical vmnic

Setup:

vSAN RDMA enabled
Prepare DVS / vSwitch portgroup for VMs using RDMA enabled vmnicX
Prepare two VMs with iperf/iperf3 package installed
- Both VMs require IP settings
Place both VMs on different hosts
Run one VM as iperf server with iperf3 -s
Run the second VM as client with iperf3 -H a.b.c.d
- A.b.c.d – IP of the iperf3 server
Cross-verify with esxtop difference between “n” network and “r” RDMA during iperf3 workload

In this setup we can verify, RDMA transport layer is not used for standard TCP/IP protocol and handled separately on the vmnic card layer.

RDMA troubleshooting:

esxtop

Esxtop provides additional fields for enablement through the “f” key.

* A: NAME = Name of device B: DRIVER = driver C: STATE = State * D: TEAM-PNIC = Team Uplink Physical NIC Name
* E: PKTTX/s = Packets Tx/s
* F: MbTX/s = Megabits Tx/s
* G: PKTRX/s = Packets Rx/s
* H: MbRX/s = Megabits Rx/s
I: %PKTDTX = % Packets Dropped (Tx) J: %PKTDRX = % Packets Dropped (Rx)
* K: QP = Number of Queue Pairs Allocated L: CQ = Number of Completion Queue Pairs Allocated M: SRQ = Number of Shared Receive Queues Allocated * N: MR = Memory Regions Allocated

Toggle fields with a-n, any other key to return:

Default setup enables only the minimum requirement for performance for MB/s, queue pairs (QP) and allocated memory regions verbs (MR). For in-depth RDMA functionality, please contact your hardware vendor.

esxcli rdma adapter statistic

Detailed adapter statistic shows in detail adapter health during the RDMA operation. Errors are not expected. Queue pairs are adjusted automatically by requirement.

esxcli rdma device stats get -d vmrdma0

Basic vSphere Functionality on vSAN

Deploy your first Virtual Machine

In this section, a VM is deployed to the vSAN datastore using the default storage policy. This default policy is preconfigured and does not require any intervention unless you wish to change the default settings, which we do not recommend.

To examine the default policy settings, navigate to Menu > Shortcuts > VM Storage Policies.

From there, select vSAN Default Storage Policy. Look under the Rules tab to see the settings on the policy:

We will return to VM Storage Policies in more detail later but suffice to say that when a VM is deployed with the default policy, it should have a mirror copy of the VM data created. This second copy of the VM data is placed on storage on a different host to enable the VM to tolerate any single failure. Also note that object space reservation is set to 'Thin provisioning', meaning that the object should be deployed as “thin”. After we have deployed the VM, we will verify that vSAN adheres to both of these capabilities.

One final item to check before we deploy the VM is the current free capacity on the vSAN datastore. This can be viewed from the [vSAN cluster] > Monitor > vSAN > Capacity view. In this example, it is 4.37 TB.

Make a note of the free capacity in your POC environment before continuing with the deploy VM exercise.

To deploy the VM, simply follow the steps provided in the wizard.

Select New Virtual Machine from the Actions Menu.

Select Create a new virtual machine.

At this point, a name for the VM must be provided, and then the vSAN Cluster must be selected as a compute resource.

Enter a Name for the VM and select a folder:

Select a compute resource:

Up to this point, the virtual machine deployment process is identical to all other virtual machine deployments that you have done on other storage types. It is the next section that might be new to you. This is where a policy for the virtual machine is chosen.

From the next menu, "4. Select storage", select the vSAN datastore, and the Datastore Default policy will actually point to the vSAN Default Storage Policy.

Once the policy has been chosen, datastores are split into those that are either compliant or non-compliant with the selected policy. As seen below, only the vSAN datastore can utilize the policy settings in the vSAN Default Storage Policy, so it is the only one that shows up as Compatible in the list of datastores.

The rest of the VM deployment steps in the wizard are quite straightforward, and simply entail selecting ESXi version compatibility (leave at default), a guest OS (leave at default) and customize hardware (no changes). Essentially you can click through the remaining wizard screens without making any changes.

Verifying Disk Layout of a VM stored in vSAN

Once the VM is created, select the new VM in the inventory, navigate to the Configure tab, and then select Policies. There should be two objects shown, "VM home" and "Hard disk 1". Both should show a compliance status of Compliant meaning that vSAN was able to deploy these objects in accordance with the policy settings.

To verify this, navigate to the Cluster's Monitor tab, and then select Virtual Objects. Once again, both the “VM home” and “Hard disk 1” should be displayed. Select “Hard disk 1” followed by View Placement Details. This should display a physical placement of RAID 1 configuration with two components, each component representing a mirrored copy of the virtual disk. It should also be noted that different components are located on different hosts. This implies that the policy setting to tolerate 1 failure is being adhered to.

The witness item shown above is used to maintain a quorum on a per-object basis. For more information on the purpose of witnesses, and objects and components in general, refer to the VMware vSAN Design Guide on core.vmware.com

The “object space reservation” policy setting defines how much space is initially reserved on the vSAN datastore for a VM's objects. By default, it is set to "Thin provisioning", implying that the VM’s storage objects are entirely “thin” and consume no unnecessary space. Note the free capacity in the vSAN datastore after deploying the VM, we see that the free capacity is very close to what it was before the VM was deployed, as displayed:

Because we have not installed anything in the VM (such as a guest OS) - it shows that only a tiny portion of the vSAN datastore has so far been used, verifying that the object space reservation setting of "Thin provisioning" is working correctly (observe that the "Virtual disks" and "VM home objects" consume less than 1GB in total, as highlighted in the "Used Capacity Breakdown" section).

Do not delete this VM as we will use it for other POC tests going forward.

Creating a Snapshot

Using the virtual machine created previously, take a snapshot of it. The snapshot can be taken when the VM is powered on or powered off. The objectives are to see that:

no setup is needed to make vSAN handle snapshots
the process for creating a VM snapshot is unchanged with vSAN
a successful snapshot delta object is created
the policy settings of the delta object are inherited directly from the base disk object

From the VM object in vCenter, click Actions > Snapshots > Take Snapshot...

Take a Snapshot of the virtual machine created in the earlier step.

Provide a name for the snapshot and optional description.

Once the snapshot has been requested, monitor tasks and events to ensure that it has been successfully captured. Once the snapshot creation has completed, additional actions will become available in the snapshot drop-down window. For example, there is a new action to Revert to Latest Snapshot and another action to Manage Snapshots.

Choose the Manage Snapshots option. The following is displayed. It includes details regarding all snapshots in the chain, the ability to delete one or all of them, as well as the ability to revert to a particular snapshot.

To see snapshot delta object information from the UI, navigate to [vSAN Cluster] > Monitor > vSAN > Virtual Objects.

There are now three objects that are associated with that virtual machine. First is the "VM Home" namespace. "Hard disk 1" is the base virtual disk, and "Hard disk 1 - poc-test-vm1.vmdk" is the snapshot delta. Notice the snapshot delta inherits its policy settings from the base disk that needs to adhere to the vSAN Default Storage Policy.

The snapshot can now be deleted from the VM. Monitor the VM’s tasks and ensure that it deletes successfully. When complete, snapshot management should look like this.

This completes the snapshot section of this POC. Snapshots in a vSAN datastore are very intuitive because they utilize vSphere native snapshot capabilities.

Clone a Virtual Machine

The next POC test is cloning a VM. We will continue to use the same VM as before. This time make sure the VM is powered on first. There are several different cloning operations available with vSAN 7. These are shown here.

The one that we will be running as part of this POC is the “Clone to Virtual Machine”. The cloning operation is a straightforward click-through operation. This next screen is the only one that requires human interaction. Simply provide the name for the newly cloned VM, and a folder if desired.

We are going to clone the VM in the vSAN cluster, so this must be selected as the compute resource.

On the “Select Storage” screen select the source datastore for the VM, “vsanDatastore”. This will all be pre-selected for you if the VM being cloned also resides on the vsanDatastore.

Select from the available options (leave unchecked - default)

This will take you to the “Ready to complete” screen. If everything is as expected, click FINISH to commence the clone operation. Monitor the VM tasks for the status of the clone operation.

Do not delete the newly cloned VM, we will be using it in subsequent POC tests.

This completes the cloning section of this POC.

vMotion a Virtual Machine Between Hosts

The first step is to power-on the newly cloned virtual machine. We will migrate this VM from one vSAN host to another vSAN host using vMotion.

Note: Take a moment to revisit the network configuration and ensure that the vMotion network is distinct from the vSAN network. If these features share the same network, performance will not be optimal.

First, determine which ESXi host the VM currently resides on. Selecting the Summary tab of the VM shows this. In this POC task, the VM that we wish to migrate is on host poc2.vsanpe.vmware.com.

Right-click on the VM and select Migrate.

"Migrate" allows you to migrate to a different compute resource (host), a different datastore or both at the same time. In this initial test, we are simply migrating the VM to another host in the cluster, so this initial screen should be left at the default of “Change compute resource only”.

Select Change compute resource only

Select a destination host and click Next.

Select a destination network and click Next.

The vMotion priority can be left as high(default), click Next.

At the “Ready to complete” window, click on FINISH to initiate the migration. If the migration is successful, the summary tab of the virtual machine should show that the VM now resides on a different host.

Verify that the VM has been migrated to a new host.

Do not delete the migrated VM. We will be using it in subsequent POC tests.

This completes the “VM migration using vMotion” section of this POC. As you can see, vMotion works just great with vSAN.

Storage vMotion a VM Between Datastores

This test will only be possible if you have another datastore type available to your hosts, such as NFS/VMFS. If so, then the objective of this test is to successfully migrate a VM from another datastore type into vSAN and vice versa. The VMFS datastore can even be a local VMFS disk on the host.

Mount an NFS Datastore to the Hosts

The steps to mount an NFS datastore to multiple ESXi hosts are described in the vSphere 7 Administrators Guide. See the Create NFS Datastore in the vSphere Client topic for detailed steps.

Storage vMotion a VM from vSAN to another Datastore Type

Currently, the VM resides on the vSAN datastore. As we did before, launch the migrate wizard, however, on this occasion move the VM from the vSAN datastore to another datastore type by selecting Change storage only.

In this POC environment, we have an NFS datastore presented to each of the ESXi hosts in the vSAN cluster. This is the intended destination datastore for the virtual machine.

Select destination storage of NFS-DS.

One other item of interest in this step is that the VM Storage Policy should also be changed to Datastore Default as the NFS datastore will not understand the vSAN policy settings.

At the “Ready to complete” screen, click FINISH to initiate the migration:

Once the migration completes, the VM Summary tab can be used to examine the datastore on which the VM resides.

Verify that the VM has been moved to the new storage.

Scale Out vSAN

Scaling out vSAN by adding a host into the cluster

One nice feature is the simplistic scale-out nature of vSAN. If you need more compute or storage resources in the cluster, simply add another host to the cluster.

Before initiating the task, revisit the current state of the cluster. There are currently three hosts in the cluster, and there is a fourth host not in the cluster. We also created two VMs in the previous exercises.

Let us also remind ourselves of how big the vSAN datastore is.

In the current state, the size of the vSAN datastore is 3.52TB with a free capacity of 3.47TB free.

Add the Fourth Host to vSAN Cluster

We will now proceed with adding a fourth host to the vSAN Cluster.

Note: Back in section 2 of this POC guide, you should have already set up a vSAN network for this host. If you have not done that, revisit section 2, and set up the vSAN network on this fourth host.

Having verified that the networking is configured correctly on the fourth host, select the new host in the inventory, right-click on it and select the option Move To… as shown below.

You will then be prompted to select the location to which the host will be moved. In this POC environment, there is only one vSAN cluster. Select that cluster.

Select a cluster as the destination for the host to move into.

The next screen is related to resource pools. You can leave this at the default, which is to use the cluster’s root resource pool, then click OK.

This moves the host into the cluster. Next, navigate to the Hosts and Clusters view and verify that the cluster now contains the new node.

As you can see, there are now 4 hosts in the cluster. However, you will also notice from the Capacity view that the vSAN datastore has not changed with regards to total and free capacity. This is because vSAN does not claim any of the new disks automatically. You will need to create a disk group for the new host and claim disks manually. At this point, it would be good practice to re-run the health check tests. If there are any issues with the fourth host joining the cluster, use the vSAN Health Check to see where the issue lies. Verify that the host appears in the same network partition group as the other hosts in the cluster.

Creating a Disk Group on a New Host

Navigate to [vSAN Cluster] > Configure > vSAN > Disk Management, select the new host and then click on the highlighted icon to claim unused disks for a new disk group:

As before, we select a flash device as a cache disk and three flash devices as capacity disks. This is so that all hosts in the cluster maintain a uniform configuration.

Select flash and capacity devices.

Verify vSAN Disk Group Configuration on New Host

Once the disk group has been created, the disk management view should be revisited to ensure that it is healthy.

Verify New vSAN Datastore Capacity

The final step is to ensure that the vSAN datastore has now grown in accordance with the capacity devices in the disk group that was just added on the fourth host. Return to the Capacity view and examine the total and free capacity fields.

As we can clearly see, the vSAN datastore has now grown in size to 4.69 TB. Free space is shown as 4.62 TB as the amount of space used is minimal. The original datastore capacity with three hosts (in the example POC environment) was 3.52TB.

This completes the “Scale-Out” section of this POC. As seen, scale-out on vSAN is simple but very powerful.

Monitoring vSAN

When it comes to monitoring vSAN, there are several areas that need particular attention.

These are the key considerations when it comes to monitoring vSAN:

Overall vSAN Health
vSAN Capacity
Resynchronization & rebalance operations in the vSAN cluster
Performance Monitoring through vCenter UI and command-line utility(vsantop)
Advanced monitoring through integrated vROPS dashboards

Overall vSAN Health

The first item to monitor is the overall health of the cluster. vSAN Skyline Health provides a consolidated list of health checks that correlate to the resiliency and performance of a vSAN cluster. From the vCenter, navigate to the cluster object, then go to [vSAN Cluster] > Monitor > vSAN > Skyline Health. This provides a holistic view of the health states pertaining to hardware and software components that constitute a vSAN cluster. There is an exhaustive validation of components states, configuration, and compatibility.

SCSI Device Controller: (7.0u3 and prio)

NVMe Device Controller: (7.0u3+)

More information about this is available here - Working with vSAN Health checks.

vSAN Capacity

vSAN storage capacity usage may be examined by navigating to [vSAN Cluster] > Monitor > vSAN > Capacity. This view provides a summary of current vSAN capacity usage, and also displays historical capacity usage information when Capacity History is selected. From the default view, a breakdown of capacity usage per object type is presented. In addition, a capacity analysis tool that facilitates effective free space remaining with respect to individual is available.

Note that beginning with vSAN 7.0, the vSAN UI now distinguishes vSphere replication objects within the capacity view.

Prior to vSAN 7u1, VMware recommended reserving 25-30% of total capacity for use as “slack space”. This space is utilized during operations that temporarily consume additional storage space, such as host rebuilds, maintenance mode, or when VMs change storage policies. Beginning with vSAN 7u1, this concept has been replaced with “capacity reservation”.

An improved methodology for calculating the amount of capacity set aside for vSAN operations yields significant gains in capacity savings (up to 18% in some cases). Additionally, the vSAN UI makes it simple to understand what amount of capacity is being reserved for vSAN’s temporary operations associated with normal usage, versus for host rebuilds (one host of capacity reserved for maintenance and host failure events).

This feature should be enabled during normal vSAN operations. To enable this new feature:

Click Reservations and Alerts.

Tick the Operations Reserve and the Host Rebuild Reserve options.

Note that when Operations Reserve and Host Rebuild Reserve are enabled, “soft” thresholds are implemented that will attempt to prevent over-consumption of vSAN datastore capacity. In addition to triggering warnings/alerts in vSphere when capacity utilization is in danger of consuming space set aside as reserved, once the capacity threshold is met, operations such as provisioning new VMs, virtual disks, FCDs, clones, iSCSI targets, snapshots, file shares, or other new objects consuming vSAN datastore capacity will not be allowed.

Note, I/O activity for existing VMs and objects will continue even if the threshold is exceeded, ensuring that current workloads remain available and functioning as expected.

As VMs will continue to be able to write to provisioned space, it is important that administrators monitor for capacity threshold alerts, and take action to free up (or add) capacity to the vSAN cluster before capacity consumption significantly exceeds the set thresholds.

vSAN 7.0u2 introduces additional monitoring capabilities for oversubscription on the vSAN datastore. Within the vCenter UI, an estimate of the capacity required if thin-provisioned objects were fully provisioned has been added to the monitoring summary at vSAN Datastore > Monitor > vSAN > Capacity:

This update also introduced a more user-friendly method to customize thresholds for triggering capacity warnings and errors in vSAN Health. To view this information and modify alerts, navigate to vSAN Datastore > Monitor > vSAN > Capacity, and click [Reservations and Alerts] on the bottom-right of the ‘Capacity Overview’ summary.

Resync Operations

Another very useful view is [vSAN Cluster] > Monitor > vSAN > Resyncing Objects view. This will display any resyncing or rebalancing operation that might be taking place on the cluster. For example, if there was a device failure, resyncing or rebuilding activity could be observed here. Resync can also happen if a device was removed or a host failed, and the CLOMd (Cluster Logical Object Manager daemon) timer expired. Resyncing objects dashboard provides details of the resync status, amount of data in transit, and estimated time to completion.

With regards to rebalancing, vSAN attempts to keep all physical disks at less than 80% capacity. If any physical disks’ capacity passes this threshold, vSAN will move components from this disk to other disks in the cluster in order to rebalance the physical storage.

In an ideal state, no resync activity should be observed, as shown below.

Resyncing activity usually indicates:

a failure of a device or host in the cluster
a device has been removed from the cluster
a physical disk has greater than 80% of its capacity consumed
a policy change has been implemented which necessitates a rebuilding of a VM’s object layout. In this case, a new object layout is created, synchronized with the source object, and then discards the source object

vSAN 7.0 also introduces visibility of vSphere replication object types within the Virtual Objects view, allowing administrators to clearly distinguish replica data from other data types.

Performance Monitoring through vCenter UI

Performance monitoring service can be used for verification of performance as well as quick troubleshooting of performance-related issues. Performance charts are available for many different levels.

Cluster
Hosts
Virtual Machines and Virtual Disks
Disk groups
Physical disks

A detailed list of performance graphs and descriptions can be found here.

The performance monitoring service is enabled by default. If in case it is disabled, it can be re-enabled through the following steps:

Navigate to the vSAN Cluster.

Click the Configure tab.
Select Services from the vSAN Section
Navigate to Performance Service, Click EDIT to edit the performance settings.

Once the service has been enabled performance statistics can be viewed from the performance menus in vCenter. The following example is meant to provide an overview of using the performance service. For purposes of this exercise, we will examine IOPS, throughput, and latency from the Virtual Machine level and the vSAN Backend level.

The cluster level shows performance shows metrics from a cluster level. This includes all virtual machinesTo access cluster-level performance graphs:

From the Cluster level in vCenter, Click on the Monitor tab.
Navigate to the vSAN section and click on Performance

For this portion of the example, we will step down a level and the performance statistics for the vSAN Backend. To access the vSAN - Backend performance metrics select the BACKEND tab from the menu on the left.

File Service is a new feature included with vSAN 7, this helps unify block and file storage. When enabled, you can monitor the performance of the File Shares in the FILE SHARE tab. More information on vSAN File Services can be found here.

The performance service allows administrators to view not only real-time data but historical data as well. By default, the performance service looks at the last hour of data. This time window can be increased or changed by specifying a custom range.

vSAN 7U1 brings with it some new features around performance monitoring. First, it is easier to compare VM performance. From the cluster level, click Monitor and then Performance. Now we can look at the cluster level or show specific VMs (Up to 10 at a time).

This makes it easy to compare IOPS, Throughput, and Latency for multiple VMs.

The next major improvement to performance monitoring is the inclusion of IOInsight. Click IOInsight and then New Instance. You can select entire hosts or specific VMs to monitor. IOInsight can monitor from 1 minute to 24 hours. The system will limit IOInsight monitoring overhead to 1% CPU and Memory and when running high IOPS of 200K/host you might see 2-3% drop in IOPS.

The capture shows you detailed information coming from each VM. Key metrics include IOPS, Throughput, Latency, Random/Sequential, Alignment, Read/Write %, and IO Size (Block Size) Distribution.

Network Monitoring

vSAN is reliant upon upstream networking resources to transmit data between cluster nodes, making network health and performance a critical aspect that influences vSAN performance and reliability.

vSAN 7.0u2 introduces new network monitoring capabilities that are useful in isolating potential network issues at the TCP/IP and physical layers that may adversely impact vSAN performance.

Newly introduced metrics, visible in the vCenter UI at [ESXi Host] > Monitor > vSAN > Performance > Physical Adapters, include:

Metric	Default Alert Threshold
	Yellow (Warning)	Red (Error)
pNIC Flow Control (AKA RX/TX Pauses)	>1%	>10%
pNIC CRC Error	>0.1%	>1%
pNIC TX Carrier Error	>0.1%	>1%
pNIC RX Generic Error	>0.1%	>1%
pNIC TX Generic Error	>0.1%	>1%
pNIC RX Missed Error	>0.1%	>1%
pNIC Buffer Overflow Error	>0.1%	>1%
pNIC RX FIFO Error	>0.1%	>1%

Additional useful metrics regarding were introduced with vSAN 7.0u1 or earlier, visible in the vCenter UI at [ESXi Host] > Monitor > vSAN > Performance > Host Network:

Monitoring vSAN through integrated vRealize Operations Manager in vCenter

Monitoring vSAN has become simpler and accessible from the vCenter UI. This is made possible through the integration of vRealize Operations Manager plugin in vCenter.

The feature is enabled through the HTML5 based vSphere client and allows an administrator to either install a new instance or integrate with an existing vRealize Operations Manager.

You can initiate the workflow by navigating to Menu > vRealize Operations as shown below:

Once the integration is complete, you can access the predefined dashboards as shown below:

The following out-of-the-box dashboards are available for monitoring purposes,

vCenter - Overview
vCenter - Cluster View
vCenter - Alerts
vSAN- Overview
vSAN - Cluster View
vSAN - Alerts

From a vSAN standpoint, the Overview, Cluster View, and Alerts dashboards allow an administrator to have a snapshot of the vSAN cluster. Specific performance metrics such as IOPS, Throughput, Latency, and Capacity related information are available as depicted below,

VM Storage Policies and vSAN

VM Storage Policies form the basis of VMware’s Software-Defined Storage vision. Rather than deploying VMs directly to a datastore, a VM Storage Policy is chosen during initial deployment. The policy contains the characteristics and capabilities of the storage required by the virtual machine. Based on the policy contents, the correct underlying storage is chosen for the VM.

If the underlying storage meets the VM storage Policy requirements, the VM is said to be in a compliant state.

If the underlying storage fails to meet the VM storage Policy requirements, the VM is said to be in a non-compliant state.

In this section of the POC Guide, we shall look at various aspects of VM Storage Policies. The virtual machines that have been deployed thus far have used the vSAN Default Storage Policy, which has the following settings:

Storage Type	vSAN
Site disaster tolerance	None(standard cluster)
Failures to tolerate	1 failure - RAID-1 (Mirroring)
Number of disk stripes per object	1
IOPS limit for object	0
Object space reservation	Thin provisioning
Flash read cache reservation	0%
Disable object checksum	No
Force provisioning	No

In this section of the POC, we will walk through the process of creating additional storage policies.

Create a New VM Storage Policy

In this part of the POC, we will build a policy that creates a stripe width of two for each storage object deployed with this policy. The VM Storage Policies can be accessed from the 'Shortcuts' page on the vSphere client (HTML 5) as shown below.

There will be some existing policies already in place, such as the vSAN Default Storage Policy (which we’ve already used to deploy VMs in section 4 of this POC guide).

To create a new policy, click on Create VM Storage Policy.

The next step is to provide a name and an optional description for the new VM Storage Policy. Since this policy will contain a stripe width of 2, we have given it a name to reflect this. You may also give it a name to reflect that it is a vSAN policy.

The next section sets the policy structure. We select Enable rules for "vSAN" Storage to set a vSAN specific policy

Now we get to the point where we create a set of rules. The first step is to select the Availability of the objects associated with this rule, i.e. the failures to tolerate.

We then set the Advanced Policy Rules. Once this is selected, the six customizable capabilities associated with vSAN are exposed. Since this VM Storage Policy is going to have a requirement where the stripe width of an object is set to two, this is what we select from the list of rules. It is officially called “Number of disk stripes per object”.

Clicking NEXT moves on to the Storage Compatibility screen. Note that this displays which storage “understands” the policy settings. In this case, the vsanDatastore is the only datastore that is compatible with the policy settings.

Note: This does not mean that the vSAN datastore can successfully deploy a VM with this policy; it simply means that the vSAN datastore understands the rules or requirements in the policy.

At this point, you can click on NEXT to review the settings. On clicking FINISH, the policy is created.

Let’s now go ahead and deploy a VM with this new policy, and let’s see what effect it has on the layout of the underlying storage objects.

Note: vSAN 7 includes a pre-defined storage policy for File Service called "FSVM_Profile_DO_NOT_MODIFY", this is intended for File Service specific entities and should not be modified.

Deploy a new VM with a new Storage Policy

The workflow to deploy a New VM remains the same until we get to the point where the VM Storage Policy is chosen. This time, instead of selecting the default policy, select the newly created StripeWidth=2 policy as shown below.

As before, the vsanDatastore should show up as the compatible datastore, and thus the one to which this VM should be provisioned.

Now let's examine the layout of this virtual machine, and see if the policy requirements are met; i.e. do the storage objects of this VM have a stripewidth of 2? First, ensure that the VM is compliant with the policy by navigating to [VM] > Configure > Policies, as shown here.

The next step is to select the [vSAN Cluster] > Monitor > vSAN > Virtual Objects and check the layout of the VM’s storage objects. The first object to check is the "VM Home" namespace. Select it, and then click on the View Placement Details icon.

This continues to show that there is only one mirrored component, but no stripe width (which is displayed as a RAID 0 configuration). Why? The reason for this is that the "VM home" namespace object does not benefit from striping, so it ignores this policy setting. Therefore, this behavior is normal and to be expected.

Now let’s examine “Hard disk 1” and see if that layout is adhering to the policy. Here we can clearly see a difference. Each replica or mirror copy of the data now contains two components in a RAID 0 configuration. This implies that the hard disk storage objects are indeed adhering to the stripe width requirement that was placed in the VM Storage Policy.

Note that each striped component must be placed on its own physical disk. There are enough physical disks to meet this requirement in this POC. However, a request for a larger stripe width would not be possible in this configuration. Keep this in mind if you plan a POC with a large stripe width value in the policy.

It should also be noted that snapshots taken of this base disk continue to inherit the policy of the base disk, implying that the snapshot delta objects will also be striped.

Edit VM Storage Policy of an existing VM

You can choose to modify the VM Storage Policy of an existing VM deployed on the vSAN datastore. The configuration of the objects associated with the VM will be modified to comply with the newer policy. For example, if NumberOfFailuresToTolerate is increased, newer components would be created, synchronized with the existing object, and subsequently, the original object is discarded. VM Storage policies can also be applied to individual objects.

In this case, we will add the new StripeWidth=2 policy to one of the VMs which still only has the default policy (NumberOfFailuresToTolerate=1, NumberOfDiskStripesPerObject=1, ObjectSpaceReservation=0) associated with it.

To begin, select the VM that is going to have its policy changed from the vCenter inventory, then select the Configure > Policies view. This VM should currently be compliant with the vSAN Default Storage Policy. Now click on the EDIT VM STORAGE POLICIES button as highlighted below.

This takes you to the edit screen, where the policy can be changed.

Select the new VM Storage Policy from the drop-down list. The policy that we wish to add to this VM is the StripeWidth=2 policy.

Once the policy is selected, click the OK button as shown above to ensure the policy gets applied to all storage objects. The VM Storage Policy should now appear updated for all objects.

Now when you revisit the Configure > Policies view, you should see the changes in the process of taking effect (Reconfiguring) or completed, as shown below.

This is useful when you only need to modify the policy of one or two VMs, but what if you need to change the VM Storage Policy of a significant number of VMs.

That can be achieved by simply changing the policy used by those VMs. All VMs using those VMs can then be “brought to compliance” by reconfiguring their storage object layout to make them compliant with the policy. We shall look at this next.

Note: Modifying or applying a new VM Storage Policy leads to additional backend IO as the objects are being synchronized.

Modify a VM Storage Policy

In this task, we shall modify an existing VM Storage policy to include an ObjectSpaceReservation=25%. This means that each storage object will now reserve 25% of the VMDK size on the vSAN datastore. Since all VMs were deployed with 40GB VMDKs with Failures to tolerate=1 failure - RAID-1 (Mirroring), the reservation value will be 20 GB.

As the first step, note the amount of free space in the vSAN datastore. This would help ascertain the impact of the change in the policy.

Select StripeWidth=2 policy from the list of available policies, and then the Edit Settings option. Navigate to vSAN > Advanced Policy Rules and modify the Object space reservation setting to 25%, as shown below

Proceed to complete the wizard with default values and click FINISH. A pop-up message requiring user input appears with details of the number of VMs using the policy being modified. This is to ascertain the impact of the policy change. Typically, such changes are recommended to be performed during a maintenance window. You can choose to enforce a policy change immediately or defer it to be changed manually at a later point. Leave it at the default, which is “Manually later”, by clicking Yes as shown below:

Next, select the Storage policy - StripeWidth=2 and click on the VM Compliance tab in the bottom pane. It will display the two VMs along with their storage objects, and the fact that they are no longer compliant with the policy. They are in an “Out of Date” compliance state as the policy has now been changed.

You can now enforce a policy change by navigating to [VM Storage Policies] and clicking on Reapply VM Storage Policy

When this button is clicked, the following popup appears.

When the reconfigure activity completes against the storage objects, and the compliance state is once again checked, everything should show as Compliant.

Since we have now included an ObjectSpaceReservation value in the policy, you may notice corresponding capacity reduction from the vSAN datastore.

For example, the two VMs with the new policy change have 40GB storage objects. Therefore, there is a 25% ObjectSpaceReservation implying 10GB is reserved per VMDK. So that's 10GB per VMDK, 1 VMDK per VM, 2 VMs equals 20 GB reserved space, right? However, since the VMDK is also mirrored, so there is a total of 40GB reserved on the vSAN datastore.

IOPS Limits and Checksum

vSAN incorporates a quality-of-service feature that can limit the number of IOPS an object may consume. IOPS limits are enabled and applied via a policy setting. The setting can be used to ensure that a particular virtual machine does not consume more than its fair share of resources or negatively impact the performance of the cluster as a whole.

This blog provides an insight into the feature - Performance Metrics when using IOPS Limits with vSAN

IOPS Limit

To create a new policy with an IOPS limit complete the following steps:

Create a new Storage Policy as done previously
In the Advanced Policy Rules set IOPS limit for object.

Note that this value is calculated as the number of IOs using a weighted size of 32KB. In this example, we will use a value of 1000. Applying this rule to an object will result in an IOPS limit of 1000x32KB=32MB/s bandwidth being set.

It is important to note that not only is read and write I/O counted in the limit, but any I/O incurred by a snapshot is counted as well. If I/O against this VM or VMDK should rise above the 1000 threshold, the additional I/O will be throttled.

Checksum

In addition, by default end-to-end software checksum is enabled to ensure data integrity. In certain scenarios, an application or operating system within the VM has an inbuilt checksum mechanism. In such instances, you may choose to disable Object Checksum at the vSAN layer.

Follow these steps to disable Object Checksum,

From the vSphere Client, navigate to Menu > Policies and Profiles and select VM Storage Policies.
Select a Storage Policy to modify.
Select Edit Settings
Click NEXT and Navigate to Advanced Policy Rules
Toggle Disable object checksum option as shown below:

vSAN POC Performance and Failure Testing Overview

Overview of a vSAN Performance POC process

POC or Proof-of-Concept testing demonstrates a conceptual proof of a desired solution. In the case of vSAN, a POC includes ESXi host setup, vCenter and cluster configuration, and resiliency and performance testing.

In the event of multiple solutions are compared, each solution should follow the exact same process during a POC for a clear distinctive comparison.

Definitions of day-operations in a Proof-of-Concept

In colloquial terms, the POC lifecycle is frequently divided into three phases:

Day-0

The post-design phase dedicated to “rack & stack” a hardware-based solution, including install of the hypervisor (in the case of ESXi) and Control plane (Virtual Center). Physical network uplinks and upstream network devices often require physical configuration, for example to support VLANs with their defined subnets supporting cluster services.

Day-1

Setup and configuration of the required solution (in the case of vSAN).

Day-2

Operational aspects operating a solution is to verify the full set of functionalities a solution offers, and typical administration task will be handled during a PoC. However, a proof-of-concept cluster should not run any production workload (i.e. facing end-users or customers) running, to disrupt usual business operations, but rather closely mirror actual production operations.

Proof-of-Concept flow diagram

This flow diagram summarizes the POC process:

Performance Testing

For vSAN POCs, performance testing is often one of the most important factors to define the success of the PoC effort.

As with all Enterprise storage solutions, there are many variables that may impact performance: type of hardware, network infrastructure, cluster design, and workload performance characteristics, all contribute to performance testing results. Identifying the IO profile of workloads from an existing environment, is one of the important factors in a proof-of-concept to reflect the future production workload accurately. Performance test results can be further process and validates the design of a solution. (in this case vSAN)

To ensure proper interpretation of results, understanding of the typical metrics used in storage performance testing is required:

Reference for interdependencies between IOPS, MByte/s, blocksize in KByte, latency in milliseconds:

IOPS = (MByte/s throughput / KByte per IO) * 1024
MByte/s = (IOPS * KByte per IO) / 1024
KByte per IO = (MByte/s * 1024) / IOPS
IO latency = reflects the latency for reads and/or writes for each IO

OIO or Outstanding IO = parallel IO queues against a single storage device

Hardware design and sizing for high performance systems

High performance systems require undivided and unshared resources, including CPU, memory, and network. For the best possible performance, consider the following design patterns:

Low or no CPU/memory overcommitment
Ensure appropriate Host Power Management settings
Use only NVMe or better devices with multiple disk groups per host
- Overcommitted PCI-Bus should be avoided
Unconstrained network bandwidth between host top-of-rack switch
- Utilize switches with deep buffers.
- Ensure that the network fabric leaf/spine/super spine or core switch layer are designed to avoid overcommitment

Cache and capacity tier design choice

An all-flash vSAN design is necessary to achieve the lowest possible latencies. Devices from higher performance classes generally result in higher throughput in terms of IOPS and/or MB/s.

Performance class choice:

NVMe 3D Xpoint -> NVMe MLC high spec -> NVMe low spec

NVMe -> SAS SSD -> SATA SSD -> Magnetic Disk

In high-performance configurations, capacity disks are usually chosen from one category lower to the caching tier to achieve an ideal balance of latency and throughput during de-staging phase.

Network fabric and hardware design choice

For network-based storage solutions, network design is one of the most critical factors contributing to stability and IO performance and is not a unique to vSAN. Any type of storage requires the same care in design and sizing, especially when Performance is a critical factor. Deep buffer switches should be always considered, as vSAN benefits greatly from undisrupted network flow. Network transport issues introduced by shallow switch buffers, or other switch configurations that distort traffic flow (e.g. packet deprioritization) can result in higher latency for each IP flow (in case of vSAN) and respectively impact IO performance in a negative manner

Though not especially common among Enterprise hardware, switches with backplanes that cannot support full link utilization among all ports supporting vSAN, are not recommended. Further, devices that introduce bottlenecks to an upstream switching device for cross-port communications (such as a fabric extender), should not be utilized for vSAN traffic.

Calculating ideal switch buffer per port and total switch

Example: 25Gbit/s link speed, RTT latency with 1ms between hosts

25Gbit/s for TX & RX full duplex = 2x 3125 MByte/s at line speed
1ms round trip time
1x TX 3125MByte/s x 0.001 seconds = 3.125 MByte buffer per port

Assuming we have multiple hosts attached to one switch, multiply by the total ports consumed to obtain total required buffer size.

Example: 48port switch, 12 hosts consuming the same switch (assumptions from prior example remain unchanged)

12 x 3.125 MByte = 37.5MByte switch buffers (TX) minimum required

Note that some models of switch may be configured in ways that impact the amount of buffer

memory available to any given port. To ensure that a switch meets these estimated requirements,
consult the switch documentation, review the switch configuration, and if necessary, obtain assistance from the switch vendor.

Inter-switch link/uplink (ISL) capacity must be considered in cases where vSAN traffic will flow between hosts connected to distinct physical switches.

For example, continuing the above example with 12 hosts, we can estimate the peak level of possible traffic northbound to leaf to spine/ super spine or core switch:

Extremely high IO may result in full link utilization for RX/TX on each host
The bandwidth requirement for the inter-switch link may be found by multiplying:

Number of hosts x their network link speed x the number of ports per host; in this example 12 x 25Gbit x 2 = 600Gbit/s bandwidth (TX+RX) capacity required to support theoretical peak utilization.

Note that though inter-switch links may be overcommitted, IP latency resulting from congestion during peak utilization will directly increase IO latency for vSAN.

Additionally, vSAN cluster designs such as 2-node, stretched cluster, and use of manually defined fault domains can introduce additional network hops and traffic requirements that result in additional IO latency; take limitations regarding factors such as inter-site links into account when planning to test performance in these scenarios.

Synthetic vs. Real-Workload setup

vSAN performance testing includes usually two to three different workload profiles for evaluation.

Ideally, evaluating the performance of a clone of the actual workload would provide the most accurate insight into potential workload performance. However, this is not always possible, so it is important to understand the use case and IO profile of potential workloads in cases where synthetic benchmark testing will be used.

Software vendors often provide guidance on a storage setup and design in order to deliver ideal performance and application reliability. These recommendations should be considered both when configuring the workload itself (e.g. the quantity of disks attached to a VM and the use case for those disks by the application), as well as the vSAN Storage Policy that is applied to the VM/virtual disk objects.

Some typical IO profiles resemble the following:

DB server
- Ideally specified by the application vendor with RAID1
- Large blocksize >32KB can be expected, high read %
- High IO demand
Front-End applications
- RAID1/5/6 depending on the application vendor
- Medium to low blocksize <32KB
- Low IO demand, high read %
Other standard application
- RAID1/5/6 depending on the application vendor
- Medium block size and can vary largely
- Low to medium IO demand, medium read %

Note: Revisit our Troubleshooting vSAN Performance Guide (link)

For example, if testing the performance of a SQL server solution, you would want to follow VMware’s SQL best practices for vSAN, including using a RAID-1 storage policy, while also setting an Object Space Reservation of 100% for objects such as the log drive(s). Furthermore, consider for performance related tests, to initialize all assigned disks for the first, to reduce the first write penalty during block allocation.

In cases where a real application cannot be deployed, a synthetic benchmark test is a useful tool to analyze vSAN performance. However, such tests require that you understand the workload profile to be tested (and should ideally be modeled from I/O characteristics observed in current ‘real-world’ scenarios). Some key characteristics of a workload profile include block size, read/write percentage, and sequential/random I/O percentage among others.

One noteworthy tool you can use to obtain your current workload I/O profiles is LiveOptics (previously known as DPACK). This tool is free to use.

Conducting a synthetic benchmark test will require knowledge of testing tools how to deploy/configure them and interpret the results. To expedite synthetic benchmarking tests, VMware publishes a ‘fling’ called HCIBench, which automates the deployment of Linux VMs with vdbench or Flexible IO tester (FIO) to generate storage I/O load on the cluster. The results can be monitored through the HCI Bench and vCenter UIs. HCIBench also utilizes vSAN Observer on the back end,and makes relevant output available as in a summary of test results.

For more information about HCIBench, please refer to the following blog posts:

A high-level view of Performance Testing:

Characterize the target workload(s)
- LiveOptics
- IOInsight
Simulate target workloads
- HCIBench
Change Storage Policies and/or vSAN Services as needed
Compare result reports & vSAN Observer output

Characterize target workload with Liveoptics

The following represent key aspects of workload performance that constitute a workload IO profile:

The relevant metrics are easily identifiable in the output from a performance auditing tool such as Live Optics:

Example: LiveOptics Report

When collecting performance data, long-term capture periods provide more accurate statistics for peak IO load and 95%-percentile performance. Performance statistics should not be collected for less than 24 hours if possible, and 7-10 days is ideal. In any case, it is critically important to capture performance statistics that highlight peak load during working hours, and any IO load generated by off-hours operations such as batch data processing or system backup jobs. Note also that LiveOptics is not limited virtualized environments -- physical hosts can be included in the data capture.

In the above example, you can identify the following details to define a workload for testing with HCIbench:

VM amount
IOPS peak and 95%ile
Read / Write ratio
IO block size independently for reads and writes average
IO latency for reads and writes average
Runtime

To create a custom workload profile for FIO with the above example with some simple calculations:

# VMs to deploy
Local disks / # VMs= 410 / 90 = 4.5 vmdk per VM = ~ 5 vmdk per VM
Read %
Read/write blocksize
IO Latency for latency probing approach or as reference
IOdepth or outstanding IO can be calculated:

- Peak IOPS = 175885

- Read + write latency

- # vmdk

- OIO setting = peak IOPS (∑ reads + writes) x (reads + write latency) / # vmdk =
175885 x 0.0029sec / # 410 vmdk = ~1.2 = ~ 2 OIO

This provides the parameters for the FIO custom profile definition for HCIbench, runtime with 3600sec:

[global]

runtime=3600

time_based=1

rw=randwrite

rwmixread=67

rwmixwrite=100

percentage_random=100

blocksize=62k,28k

ioengine=libaio

buffered=0

direct=1

fsync=1

group_reporting

log_avg_msec=1000

continue_on_error=all

[job0]

filename=/dev/sda

size=100%

iodepth=2

[job1]

filename=/dev/sdb

size=100%

iodepth=2

When testing any storage solution, be sure to set a sufficiently long run-time to accurately assess the performance characteristics of the system. Short performance benchmark runtimes often present inaccurate metrics that are artificially enhanced by caching performance but cannot be sustained during longer term operations. This is usually reflected in longer-term performance tests regardless if the storage solution tested is an HCI solution, or a traditional array.

Because of caching effects, when testing with small IO blocksizes or low IO write %, a longer runtime is required to confirm the storage performance. Please also consult FIO documentation for additional information: (link)

Performance test approach flow diagram

IO load definitions

Some helpful hints for interpreting performance test results:

Peak workload

Peak workload performance reflects the observed maximum of a storage system for all hosts/workloads participating in a performance test. Spikes to peak workload performance are generally associated with IO intensive application operations.

95%ile workload

95%ile workload performance identifies 95% of all IO processed in a specific time frame (capture time):

Average performance

Average performance provides the total average IO performance across all VMs within a cluster or host and time frame and is less ideal for workload profile definition. IO burst or unusually high IO outlier workloads may be obscured by other more normalized workloads contributing to the average.

Note: HCIbench easy-run feature provides in a simple way this test with different types of workloads.

Peak and 95% percentile workload are the most reliable values to be used to develop an IO profile for synthetic benchmark testing, and most effectively demonstrate the cluster’s sustainable workload performance when tested for an adequate period of time (> 1hr).

Comparison between all performance measurements

Maximum’s performance test with single/multi-tier solution

Storage Solutions are most often using any type of in-cache function to pre-buffer IO through the backend channels. Any test performed should highlight any type of shortcoming during an IO test run.

HCIbench provides the choice to upload a customized FIO profile to replay on a storage system.

Recommended procedure is by probing the amount of parallel outstanding IO against the experienced IO latency and setting thresholds during IO test period and let the IO generating application (FIO) automatically adjust the outstanding IO. In this scenario, IO will receive a certain maximum in IO latency but not overwhelm the underlay hardware. Focus – IO representation on latency versus uncontrolled synthetic tests.

In this example we are using 50ms IO latency for IO reads and IO writes combined and separately control reads/writes with FIO. In order to achieve best possible performance outcome each attached vmdk, to generate IO, will become a critical factor.

Common case, using the same amount of average vmdk as in production, or limit to =4 (higher side). More VMs per host to be considered (4-16 VMs) to achieve a balanced and heavy workload.

Example:

Read/write ratio 70/30%, 100% randomness, 4k block reads and writes
Unbuffered and verified
Generated IO to be 2:1 dedup- and compressible (or 50%)
Latency target with 50000us or 50ms IO reads + writes
Latency window to increase every 5min the value of outstanding IO
95%ile for all IO require to be executed and requested
Test must run 3600s (must be also set in HCIbench)
Example Test requires 4x vmdk and total size must exceed caching tier (or memory) capacity of Tier-1
OIO start with =1 and maximum =256 for each vmdk

[global]

runtime=3600
time_based=1
rw=randrw
rwmixread=70
rwmixwrite=30
percentage_random=100

# IO split example
#bssplit=4k/20:64k/40:32k/40

blocksize=4k,4k
ioengine=libaio
buffered=0
direct=1
fsync=1
group_reporting
log_avg_msec=1000
continue_on_error=all
# 50% buffer dedup and compress = 2:1, if value set to 80% = 5:1
dedupe_percentage=50
buffer_compress_percentage=50
refill_buffers=1

[job-sda]
# /dev/sda in hcibench is not the OS disk filename=/dev/sda
size=100%
iodepth=256
iodepth_low=1
latency_target=50000
latency_window=300000
latency_percentile=95
random_distribution=zoned:25/25:25/25:25/25:25/25

[job-sdb]
filename=/dev/sdb
size=100%
iodepth=256
iodepth_low=1
latency_target=50000
latency_window=300000
latency_percentile=95
random_distribution=zoned:25/25:25/25:25/25:25/25

[job-sdc]
filename=/dev/sdc
size=100%
iodepth=256
iodepth_low=1
latency_target=50000
latency_window=300000
latency_percentile=95
random_distribution=zoned:25/25:25/25:25/25:25/25

[job-sdd]
filename=/dev/sdd
size=100%
iodepth=256
iodepth_low=1
latency_target=50000
latency_window=300000
latency_percentile=95
random_distribution=zoned:25/25:25/25:25/25:25/25

Over the course of the IO test, the latency will increase to the maximum and might overreach above the threshold and reduces accordingly by IO engine from FIO.

Traditional storage system should incorporate one LUN per VM and all vmdk being placed on the very same.

Comparison between different type of storage solutions can be achieved through this approach by controlling IO through latency.

Performance Monitoring through vCenter UI

vSAN includes a performance monitoring service that can be used to observe performance and allows to troubleshoot related issues in a quick fashion. Performance charts are available for many different components of the vSAN solution:

Cluster
Hosts
Virtual Machines and Virtual Disks
Disk groups
Physical disks

A detailed list of performance graphs and descriptions can be found here.

The vSAN performance monitoring service is enabled by default. If in case it is disabled, it can be re-enabled through the following steps:

Navigate to the vSAN Cluster.

Click the Configure tab.
Select Services from the vSAN Section
Navigate to Performance Service, Click EDIT to edit the performance settings.

Once the service has been enabled (default), performance statistics can be viewed from the performance menus in vCenter. The following example is meant to provide an overview of using the performance service. For purposes of this exercise, we will examine IOPS, throughput, and latency from the Virtual Machine level (reflecting IO performance observed by VMs) and the vSAN Backend level (vSAN operations such as replication and resync traffic that are not directly observable at the VM layer).

The cluster level shows performance metrics from a cluster level perspective and includes all virtual machine operations.

To observe IOPS from a cluster level:

From the Cluster level in vCenter, Click on the Monitor tab.
Navigate to the vSAN section and click on Performance

We will now examine performance statistics for the vSAN Backend.

To access the vSAN - Backend performance metrics select the BACKEND tab from the menu on the left.

File Service is a new feature included with vSAN 7 that provides file share capabilities. When enabled, you can monitor the performance of the File Shares in the FILE SHARE tab.

More information on vSAN File Services can be found here.

The performance service allows administrators to view not only real-time data, but historical data as well. By default, the performance service view displays the last hour of data. This time window can be changed by specifying a custom range of 24 hours.

vCenter real-time data

If more detailed information regarding outstanding IO (OIO), read/write latency, IOPS or other performance counters is required, real-time data collected by vCenter allows us to analyze performance for the previous hour.

Performance counters are available for Cluster / ESXi host / VMs and their associated disks including vSAN backend data.

Advanced statistics through Support Insight

Beginning with version 6.7U3, vCenter provides advanced support statistics to facilitate troubleshooting the vSphere and vSAN solution stacks.

Additionally, in 6.7U3 and above, advanced network statistics are available to monitor physical network uplinks or VMkernel ports.

The ‘Network diagnostic mode’ option for the vSAN performance service should be only enable if required by VMware support.

Advanced and debug information can be accessed via Monitor -> Support -> Performance for Support

IOInsight

vSphere 7U1 introduces IOinsight for deep-dive performance analytics around performance. IOInsight is available as fling if using vSphere versions < 7.0U1.

From the cluster level, click Monitor and then Performance. Now we can examine performance from the cluster level or observe specific VMs (up to 10 at a time).

This makes it easy to compare IOPS, throughput, and latency for multiple VMs.

The next major improvement to performance monitoring is the inclusion of IOInsight in 7.0U1.

Click IOInsight and then New Instance. You can select entire hosts or specific VMs to monitor. IOInsight can capture statistics for a time period ranging from 1 minute to 24 hours. The system will limit IOInsight monitoring overhead to 1% CPU and memory consumption but note that when running with high IOPS of 200K/host or greater, you might see 2-3% drop in IOPS while the service is running.

The capture shows you detailed metrics for each VM. Key metrics include IOPS, Throughput, Latency, Random/Sequential, Alignment, read/Write %, latency and IO Size (Block Size) Distribution.

Performance monitoring through vsantop utility

Beginning in vSphere 6.7 Update 3 a new command-line utility, vsantop, was introduced. This utility is focused on monitoring vSAN performance metrics at an individual ESXi host level. Traditionally with ESXi, an embedded utility called esxtop was used to view real-time performance metrics. This utility assisted in ascertaining the resource utilization and performance of the system. However, vSAN required a custom utility with awareness of its distributed architecture. It collects and persists statistical data in a RAM disk. Based on the configured interval rate, the metrics are displayed on the secure shell console. This interval is configurable and can be reduced or increased depending on the amount of detail required. The workflow is illustrated below for a better understanding:

To initiate vsantop, log in to the ESXi host through a secure shell (ssh) with root user privileges and run the command vsantop on the ssh console.

vsantop provides detailed insights into vSAN component level metrics at a low interval rate. This helps in understanding resource consumption and utilization patterns. vsantop is primarily intended for advanced vSAN users and VMware support personnel.

Monitoring vSAN through integrated vRealize Operations Manager in vCenter

One option for monitoring vSAN is through the integration of vRealize Operations Manager plugin in vCenter.

The feature is enabled in the HTML5 based vSphere client and allows an administrator to either install a new vRealize Operations Manager instance or integrate with an existing deployment.

You can initiate the workflow by navigating to Menu > vRealize Operations as shown below:

Once the integration is complete, you can access the predefined dashboards as shown below:

The following out-of-the-box dashboards are available for monitoring purposes:

vCenter - Overview
vCenter - Cluster View
vCenter - Alerts
vSAN- Overview
vSAN - Cluster View
vSAN - Alerts

For vSAN, the Overview, Cluster View, and Alerts dashboards allow an administrator to have a snapshot of a vSAN cluster’s current state. Specific performance metrics such as IOPS, Throughput, Latency, and Capacity related information are also available as depicted below:

Performance Testing Using HCIbench

HCIbench aims to simplify and accelerate customer Proof of Concept (POC) performance testing in a consistent and controlled methodology. The tool fully automates the end-to-end process of deploying test VMs, coordinating workload runs, aggregating test results, and collecting necessary data for troubleshooting purposes. Evaluators choose the profiles they are interested in and HCIbench does the rest quickly and easily.

This section provides an overview and recommendations for successfully using HCIbench. For complete documentation, refer to the HCIbench User Guide.

HCIbench and complete documentation can be downloaded from https://labs.vmware.com/flings/hcibench

This tool is provided free of charge and with no restrictions. Support will be provided solely on a best-effort basis as time and resources allow, by the VMware vSAN Community Forum.

Deploying HCIbench

Step 1 – Deploy the OVA

Firstly, download and deploy the HCIbench appliance. The process for deploying the HCIbench appliance is no different from deploying any other appliance in vSphere platform.

Step 2 – HCIbench Configuration

After deployment, navigate to http://<Controller VM IP>:8080/ to configure the appliance

There are three main sections to consider:

vSAN cluster and host information
Guest VM Specification
Workload Definitions

For detailed steps on configuring and using HCIbench refer to the HCIbench User Guide.

vSAN Hybrid vs. vSAN All-flash

vSAN hybrid cluster are defined by the SSD/NVMe caching disks and magnetic disks and the performance between caching and capacity is significantly different. SSDs/NVMe usually can perform > 300MB/s for reads and writes in parallel whereas magnetic disks usually reach less then 50MByte/s with a latency <20ms (typical less than 250 IOPS). If reads can be served by the caching tier (if not de-staged) the performance can reach the maximums values from the caching tier. Ideal approach reaching > 90% of all IO served from the caching tier under normal circumstances.

Ideal Test-workflow for vSAN Hybrid:

Working set used to accommodate 30% of IO write on the caching tier

VM vmdk disks are initialized with zero or random (ideal), covering physical block allocation - first write penalty
Enforced de-staging with HCIbench is not required
Long Test-Runs to reach the phase of “hot cache” and multiple de-staging phases to identify max/95%/avg and sustained workload
All-flash on the other hand uses the caching tier only for IO writes and if the IO is de-staged, read IO can be served from the capacity tier in parallel.

Ideal Test-workflow for vSAN all-flash:

Working set to over-size for the caching tier size of a solution
VM vmdk disks are initialized with zero or random (ideal), covering physical block allocation - first write penalty
Enforced de-staging with HCIbench to identify clearly IO write performance from the start
Long Test-Runs to reach multiple de-staging phases to identify max/95%/avg and sustained workload across the cluster

Considerations for Defining Test Workloads

Either FIO (default) or vdbench can be chosen as the testing engine. Here, we recommend using FIO, due to the exhaustive list of parameters that can be set. Pre-defined parameter files can be uploaded to HCIbench to be executed (a wider variety of options are available, such as different read/write block sizes outside of what can be defined within the configuration page, for a full list of FIO options, consult the FIO documentation (link)

Although 'Easy Run' can be selected, we recommend explicitly defining a workload pattern to ensure that tests are tailored to the performance requirements of the POC. Below, we walk through some of the important considerations.

Working set

Defining the appropriate working set is one of the most important factors for correctly running performance tests and obtaining accurate results and is defined as IO change to the VM vmdk. For the best performance, a virtual machine’s working set should be mostly in the cache. Care should be taken when sizing your vSAN caching tier to account for all the virtual machines’ working sets residing in the cache. A general rule of thumb, in hybrid environments, is to size cache as 10% of your consumed virtual machine storage (not including replica objects). While this is adequate for most workloads, understanding your workload’s working set before sizing is a useful exercise.

For all-flash environments, consult the table below

The following process is an example of sizing an appropriate working set for performance testing with HCIbench:

Consider a four-node cluster with one 400GB SSD per node. This gives the cluster a total cache size of 1.6TB. For a Hybrid cluster, the total cache available in vSAN is split 70% for read cache and 30% for write cache. This gives the cluster in our example 1120GB of available read cache and 480GB of available write cache.

To correctly fit the HCIbench within the available cache, the total capacity of all VMDKs used for I/O testing should not exceed 1,120GB. For All-Flash, 100% of the cache is allocated for writes (thus the total capacity of all VMDKs is 1.6TB).

We create a test scenario with four VMs per host, each VM having 5 X 10GB VMDKs, resulting in a total size of 800GB -- this will allow the test working set to fit within the cache.

The number and size of the data disk, along with the number of threads should be adjusted so that the product of the test set is less than the total size of the cache tier.
Thus:

# of VMs x # of Data Disk x Size of Data Disk x # of Threads < Size of Cache Disk x Disk Groups per Host x Number of Hosts x [70% read cache (hybrid)]

For example:

4 VMs x 5 Data Disks x 10GB x 1 Thread = 800GB,
400GB SSDs x 70% x 1 Disk Group per Host x 4 Hosts = 1,120GB

Therefore, 800GB working set size is less than the 1,120GB read cache in cluster, i.e there is more read cache available than our defined working set size. Therefore, this is an acceptable working set size.

Note: the maximum working set size per cache disk is 600GB. If your cache disk size is greater, use this value in the above calculations.

Sequential workloads versus random workloads

Before doing performance tests, it is important to understand the performance characteristics of the production workload to be tested: different applications have different performance characteristics. Understanding these characteristics is crucial to successful performance testing. When it is not possible to test with the actual application or application-specific testing tool it is important to design a test that matches the production workload as closely as possible. Different workload types will perform differently on vSAN.

Sustained sequential write workloads (such as VM cloning operations) running on vSAN will simply fill the cache and future writes will need to wait for the cache to be de-staged to the capacity tier before more I/Os can be written to the cache. Thus, in a hybrid environment, performance will be a reflection of the spinning disk(s) and not of flash. The same is true for sustained sequential reads. If the block is not in the cache, it will have to be fetched from the spinning disk. Mixed workloads will benefit more from vSAN’s caching design.
HCIbench allows you to change the percentage read and the percentage random parameters; a good starting point here is to set the percentage read parameter to 70% and the percentage random parameter to 30%.

Prepare Virtual Disk Before Testing

To achieve a 'clean' performance run, the disks should be wiped before use. To achieve this, select a value for the 'Prepare Virtual Disk Before Testing'. This option will either zero or randomize the data (depending on the selection) on the disks for each VM being used in the test, helping to alleviate a first write penalty during the performance testing phase. We recommend that the disks are randomized if using the Deduplication & Compression feature.

Warm up period

As a best practice, performance tests should include at least a 15-minute warm-up period. Also, keep in mind that the longer testing runs the more accurate the results will be. Warm-up period is used mainly in hybrid solutions.

Testing Runtime

HCIbench tests should be configured for at least one hour to observe the effects of destaging from the cache to the capacity tier. Runtime is defined by the amount of block changes in the caching tier and chosen workload load profile. Small blocksizes required more time to achieve a total block change on the cache versus large blocksizes.

Blocksize

It is important to match the block size of the test to that of the workload being simulated, as this will directly affect the throughput and latency of the cluster. Therefore, it is paramount that this information be gathered before the start of the tests (for instance, from a Live Optics assessment).

Results

After testing is completed, you can view the results at http://<Controller VM IP>:8080/results in a web browser. A summary file of the tests will be present inside the subdirectory corresponding to the test run. To export the results to a ZIP file, click on the 'save result' option on the HCIbench configuration page (and wait for the ZIP file to be fully populated).

As HCIbench is integrated with the vSAN performance service, the performance data can also be reviewed within the vCenter HTML5 UI, under [vSAN cluster] > Monitor > vSAN > Performance.

Testing Hardware Failures

Understanding Expected Behaviors

When conducting any failure testing, it is important to consider the expected outcome before the test is conducted. With each test described in this section, you should first read the preceding description to first understand how the test will affect the system.

Note: It is important to test one scenario at any instance and restore completely before the next test condition

As with any system design, a configuration is built to tolerate a certain level of availability and performance. It is important that each test is conducted within the limit of the design systematically. By default, VMs deployed on vSAN inherit the default storage policy, with the ability to tolerate one failure. When a second failure is introduced without resolving the first, the VMs will not be able to tolerate the second failure and may become inaccessible. It is important that you resolve the first failure or test within the system limits to avoid such unexpected outcomes.

VM Behavior when Multiple Failures Encountered

Previously we discussed VM operational states and availability. To recap, a VM remains accessible when the full mirror copy of the objects are available, as well as greater than 50% of the components that make up the VM; the witness components are there to assist with the latter requirement.

Let’s talk a little about VM behavior when there are more failures in the cluster than the NumberOfFailuresToTolerate setting in the policy associated with the VM.

VM Powered on and VM Home Namespace Object Goes Inaccessible

If a running VM has its VM Home Namespace object go inaccessible due to failures in the cluster, a number of different things may happen. Once the VM is powered off, it will be marked "inaccessible" in the vSphere web client UI. There can also be other side effects, such as the VM getting renamed in the UI to its “.vmx” path rather than VM name, or the VM being marked "orphaned".

VM Powered on and Disk Object is inaccessible

If a running VM has one of its disk objects become inaccessible, the VM will keep running, but its VMDK’s I/O is stalled. Typically, the Guest OS will eventually time out I/O. Some operating systems may crash when this occurs. Other operating systems, for example, some Linux distributions, may downgrade the filesystems on the impacted VMDK to read-only. The Guest OS behavior and even the VM behavior is not vSAN specific. It can also be seen on VMs running on traditional storage when the ESXi host suffers an APD(All Paths Down).

Once the VM becomes accessible again, the status should resolve, and things go back to normal. Of course, data remains intact during these scenarios.

What happens when a server fails or is rebooted?

A host failure can occur in numerous ways, it could be a crash, or it could be a network issue (which is discussed in more detail in the next section). However, it could also be something as simple as a reboot, and that the host will be back online when the reboot process completes. Once again, vSAN needs to be able to handle all of these events.

If there are active components of an object residing on the host that is detected to be failed (due to any of the stated reasons) then those components are marked as ABSENT. I/O flow to the object is restored within 5-7 seconds by removing the ABSENT component from the active set of components in the object.

The ABSENT state is chosen rather than the DEGRADED state because in many cases a host failure is a temporary condition. A host might be configured to auto-reboot after a crash, or the host’s power cable was inadvertently removed, but plugged back in immediately. vSAN is designed to allow enough time for a host to reboot before starting to rebuild objects on other hosts so as not to waste resources. Because vSAN cannot tell if this is a host failure, a network disconnect, or a host reboot, the 60-minute timer is once again started. If the timer expires, and the host has not rejoined the cluster, a rebuild of components on the remaining hosts in the cluster commences.

If a host fails or is rebooted, this event will trigger a "Host connection and power state" alarm, and if vSphere HA is enabled on the cluster. It will also cause a" vSphere HA host status" alarm and a “Host cannot communicate with all other nodes in the vSAN Enabled Cluster” message.

If NumberOfFailuresToTolerate=1 or higher in the VM Storage Policy, and an ESXi host goes down, VMs not running on the failed host continue to run as normal. If any VMs with that policy were running on the failed host, they will get restarted on one of the other ESXi hosts in the cluster by vSphere HA, as long as it is configured on the cluster.

Caution: If VMs are configured in such a way as to not tolerate failures, (NumberOfFailuresToTolerate=0), a VM that has components on the failing host will not have objects protected on another host and might not survive a failure.

Simulating Failure Scenarios

It can be useful to run simulations on the loss of a particular host or disk, to see the effects of planned maintenance or hardware failure. The Data Migration Pre-Check feature can be used to check object availability for any given host or disk. These can be run at any time without affecting VM traffic.

Loss of a Host

Navigate to:

[vSAN Cluster] → Monitor → vSAN → Data Migration Pre-check

From here, you can select the host to run the simulations on. After a host is selected, the pre-check can be run against three available options, i.e., Full data migration, Ensure accessibility, No data migration:

Select the desired option and click the Pre-Check button. This gives us the results of the simulation. From the results, three sections are shown, i.e.: Object Compliance and Accessibility, Cluster Capacity and Predicted Health.

The Object Compliance and Accessibility view show how the individual objects will be affected:

Cluster Capacity shows how the capacity of the other hosts will be affected. Below we see the effects of the 'Full data migration' option:

Predicted Health shows how the health of the cluster will be affected:

Loss of a Disk

Navigate to:

[vSAN Cluster] → Configure → vSAN → Disk Management

From here, select a host or disk group to bring up a list of disks. Simulations can then be run on a selected disk or the entire disk group:

Once the Pre-Check Data Migration button option is selected, we can run different simulations to see how the objects on the disk are affected. Again, the options are Full data migration, Ensure accessibility (default) and No data migration:

Selecting 'Full data migration' will run a check to ensure that there is sufficient capacity on the other hosts:

Host Failures

Simulate Host Failure without vSphere HA

Without vSphere HA, any virtual machines running on the host that fails will not be automatically started elsewhere in the cluster, even though the storage backing the virtual machine in question is unaffected.

Let’s take an example where a VM is running on a host (10.159.16.118):

It would also be a good test if this VM also had components located on the local storage of this host. However, it does not matter as the test will still highlight the benefits of vSphere HA.

Next, the host, namely 10.159.16.118 is rebooted. As expected, the host is not responding in vCenter, and the VM becomes disconnected. The VM will remain in a disconnected state until the ESXi host has fully rebooted, as there is no vSphere HA enabled on the cluster, so the VM cannot be restarted on another host in the cluster.

If you now examine the policies of the VM, you will see that it is non-compliant. This VM should be able to tolerate one failure but due to the failure currently in the cluster (i.e. the missing ESXi host that is rebooting) this VM cannot tolerate another failure, thus it is non-compliant with its policy.

What can be deduced from this is that, not only was the VM’s compute running on the host, which was rebooted, but that it also had some components residing on the storage of the host that was rebooted. We can see the effects of this on the other VMs in the cluster, that show reduced availability:

Once the ESXi host has rebooted, we see that the VM is no longer disconnected but left in a powered off state.

If the physical disk placement is examined, we can clearly see that the storage on the host that was rebooted, i.e. 10.159.16.118, was used to store components belonging to the VM.

Simulate Host Failure With vSphere HA

Let’s now repeat the same scenario, but with vSphere HA enabled on the cluster. First, power on the VM from the last test.

Next, navigate to [vSAN cluster] > Configure > Services > vSphere Availability. vSphere HA is turned off currently.

Click on the EDIT button to enable vSphere HA. When the wizard pops up, toggle the vSphere HA option as shown below, then click OK.

Similarly, enable DRS under Services > vSphere DRS.

This will launch several tasks on each node in the cluster. These can be monitored via the "Recent Tasks" view near the bottom. When the configuring of vSphere HA tasks are complete, select [vSAN cluster] > Summary and expand the vSphere HA window and ensure it is configured and monitoring. The cluster should now have vSAN, DRS and vSphere HA enabled.

Verify the host the test VM is residing on. Now repeat the same test as before by rebooting the host. Examine the differences with vSphere HA enabled.

On this occasion, a number of HA related events should be displayed on the "Summary" tab of the host being rebooted (you may need to refresh the UI to see these):

This time, rather than the VM becoming disconnected for the duration of the host reboot like was seen in the last test, the VM is instead restarted on another host, in this case, 10.159.16.115:

Earlier we saw that there were some components belonging to the objects of this VM residing a disk of the host that was rebooted. These components now show up as “Absent” under [vSAN Cluster] > Monitor > vSAN > Virtual Objects > View Placement Details, as shown below:

Once the ESXi host completes rebooting, assuming it is back within 60 minutes, these components will be rediscovered, resynchronized and placed back in an "Active" state.

Should the host be disconnected for longer than 60 minutes (the CLOMD timeout delay default value), the “Absent” components will be rebuilt elsewhere in the cluster.

Disk Failures

Drive is removed unexpectedly from ESXi Host

When a drive contributing storage to vSAN is removed from an ESXi host without decommissioning, all the vSAN components residing on the disk go ABSENT and are inaccessible.

The ABSENT state is chosen over DEGRADED because vSAN knows the disk is not lost, but just removed. If the disk is placed back in the server before a 60-minute timeout, no harm is done and vSAN syncs it back up. In this scenario, vSAN is back up with full redundancy without wasting resources on an expensive rebuild task.

Expected Behaviors

If the VM has a policy that includes NumberOfFailuresToTolerate=1 or greater, the VM’s objects will still be accessible from another ESXi host in the vSAN Cluster.
The disk state is marked as ABSENT and can be verified via vSphere client UI.
At this point, all in-flight I/O is halted while vSAN reevaluates the availability of the object (e.g. VM Home Namespace or VMDK) without the failed component as part of the active set of components.
If vSAN concludes that the object is still available (based on a full mirror copy and greater than 50% of the components being available), all in-flight I/O is restarted.
The typical time from physical removal of the disk, vSAN processing this event, marking the component ABSENT, halting and restoring I/O flow is approximately 5-7 seconds.
If the same disk is placed back on the same host within 60 minutes, no new components will be rebuilt.
If 60 minutes pass and the original disk has not been reinserted in the host, components on the removed disk will be built elsewhere in the cluster (if capacity is available) including any newly inserted disks claimed by vSAN.
If the VM Storage Policy has NumberOfFailuresToTolerate=0, the VMDK will be inaccessible if one of the VMDK components (think one component of a stripe or a full mirror) resides on the removed disk. To restore the VMDK, the same disk must be placed back in the ESXi host. There is no other option for recovering the VMDK.

SSD is Pulled Unexpectedly from ESXi Host

When a solid-state disk drive is pulled without decommissioning it, all the vSAN components residing in that disk group will go ABSENT and are inaccessible. In other words, if an SSD is removed, it will appear as a removal of the SSD as well as all associated magnetic disks backing the SSD from a vSAN perspective.

Expected Behaviors

If the VM has a policy that includes NumberOfFailuresToTolerate=1 or greater, the VM’s objects will still be accessible.
Disk group and the disks under the disk group states will be marked as ABSENT and can be verified via the vSphere web client UI.
At this point, all in-flight I/O is halted while vSAN reevaluates the availability of the objects without the failed component(s) as part of the active set of components.
If vSAN concludes that the object is still available (based on a full mirror copy and greater than 50% of components being available), all in-flight I/O is restarted.
The typical time from physical removal of the disk, vSAN processing this event, marking the components ABSENT, halting and restoring I/O flow is approximately 5-7 seconds.
When the same SSD is placed back on the same host within 60 minutes, no new objects will be re-built.
When the timeout expires (default 60 minutes), components on the impacted disk group will be rebuilt elsewhere in the cluster, providing enough capacity and is available.
If the VM Storage Policy has NumberOfFailuresToTolerate=0, the VMDK will be inaccessible if one of the VMDK components (think one component of a stripe or a full mirror) exists on disk group whom the pulled SSD belongs to. To restore the VMDK, the same SSD has to be placed back in the ESXi host. There is no option to recover the VMDK.

What Happens When a Disk Fails?

If a disk drive has an unrecoverable error, vSAN marks the disk as DEGRADED as the failure is permanent.

Expected Behaviors

If the VM has a policy that includes NumberOfFailuresToTolerate=1 or greater, the VM’s objects will still be accessible.
The disk state is marked as DEGRADED and can be verified via vSphere web client UI.
At this point, all in-flight I/O is halted while vSAN reevaluates the availability of the object without the failed component as part of the active set of components.
If vSAN concludes that the object is still available (based on a full mirror copy and greater than 50% of components being available), all in-flight I/O is restarted.
The typical time from physical removal of the drive, vSAN processing this event, marking the component DEGRADED, halting, and restoring I/O flow is approximately 5-7 seconds.
vSAN now looks for any hosts and disks that can satisfy the object requirements. This includes adequate free disk space and placement rules (e.g. 2 mirrors may not share the same host). If such resources are found, vSAN will create new components on there and start the recovery process immediately.
If the VM Storage Policy has NumberOfFailuresToTolerate=0, the VMDK will be inaccessible if one of the VMDK components (think one component of a stripe) exists on the pulled disk. This will require a restore of the VM from a known good backup.

What Happens When an SSD Fails?

An SSD failure follows a similar sequence of events to that of a disk failure with one major difference; vSAN will mark the entire disk group as DEGRADED. vSAN marks the SSD and all disks in the disk group as DEGRADED as the failure is permanent (disk is offline, no longer visible, and others).

Expected Behaviors

If the VM has a policy that includes NumberOfFailuresToTolerate=1 or greater, the VM’s objects will still be accessible from another ESXi host in the vSAN cluster.
Disk group and the disks under the disk group states will be marked as DEGRADED and can be verified via the vSphere web client UI.
At this point, all in-flight I/O is halted while vSAN reevaluates the availability of the objects without the failed component(s) as part of the active set of components.
If vSAN concludes that the object is still available (based on available full mirror copy and witness), all in-flight I/O is restarted.
The typical time from physical removal of the drive, vSAN processing this event, marking the component DEGRADED, halting, and restoring I/O flow is approximately 5-7 seconds.
vSAN now looks for any hosts and disks that can satisfy the object requirements. This includes adequate free SSD and disk space and placement rules (e.g. 2 mirrors may not share the same hosts). If such resources are found, vSAN will create new components on there and start the recovery process immediately.
If the VM Storage Policy has NumberOfFailuresToTolerate=0, the VMDK will be inaccessible if one of the VMDK components (think one component of a stripe) exists on disk group whom the pulled SSD belongs to. There is no option to recover the VMDK. This may require a restore of the VM from a known good backup.

Warning: Test one thing at a time during the following POC steps. Failure to resolve the previous error before introducing the next error will introduce multiple failures into vSAN which it may not be equipped to deal with, based on the NumberOfFailuresToTolerate setting, which is set to 1 by default.

vSAN Disk Fault Injection Script for POC Failure Testing

A python script to help with POC disk failure testing is available on all ESXi hosts. The script is called vsanDiskFaultInjection.pyc and can be found on the ESXi hosts in the directory /usr/lib/vmware/vsan/bin. To display the usage, run the following command:

[root@cs-ie-h01:/usr/lib/vmware/vsan/bin] python ./vsanDiskFaultInjection.pyc -h

Usage:

injectError.py -t -r error_durationSecs -d deviceName

injectError.py -p -d deviceName

injectError.py -z -d deviceName

injectError.py -c -d deviceName

Options:

-h, --help show this help message and exit

-u Inject hot unplug

-t Inject transient error

-p Inject permanent error

-z Inject health error

-c Clear injected error

-r ERRORDURATION Transient error duration in seconds

-d DEVICENAME, --deviceName=DEVICENAME

[root@cs-ie-h01:/usr/lib/vmware/vsan/bin]

Warning: This command should only be used in pre-production environments during a POC. It should not be used in production environments. Using this command to mark disks as failed can have a catastrophic effect on a vSAN cluster.

Readers should also note that this tool provides the ability to do “hot unplug” of drives. This is an alternative way of creating a similar type of condition. However, in this POC guide, this script is only being used to inject permanent errors.

Note: With the release of vSAN 6.7 P02 and vSAN 7.0 P01 , vSAN introduced Full Rebuild Avoidance. In some circumstances transient device/storage errors could cause vSAN objects to be marked as degraded and, as a result vSAN may unnecessarily mark device as failed. Now vSAN can differentiate between transient and permanent storage errors thereby avoids marking device as FAILED, thus avoiding unnecessary rebuilds of objects if a device recovers from a transient failure.
However for the purposed of POC it maybe required to simulate failures. Below procedure outlines toggling this feature on or off.

As setting is enabled on a per vSAN node basis, to view the current value issue from an ESXi host issue:

esxcli system settings advanced list -o /LSOM/lsomEnableFullRebuildAvoidance

To disable (0) issue:

esxcli system settings advanced set -o /LSOM/lsomEnableFullRebuildAvoidance -i 0

Once the POC failure simulations have completed enable this important feature (1)

esxcli system settings advanced set -o /LSOM/lsomEnableFullRebuildAvoidance -i 1

Pull Magnetic Disk/Capacity Tier SSD and Replace before Timeout Expires

In this first example, we shall remove a disk from the host using the vsanDiskFaultInjection.pyc python script rather than physically removing it from the host.

It should be noted that the same tests can be run by simply removing the disk from the host. If physical access to the host is convenient, literally pulling a disk would test exact physical conditions as opposed to emulating it within the software.

Also, note that not all I/O controllers support hot unplugging drives. Check the vSAN Compatibility Guide to see if your controller model supports the hot unplug feature.

We will then examine the effect this operation has on vSAN, and virtual machines running on vSAN. We shall then replace the component before the CLOMD timeout delay expires (default 60 minutes), which will mean that no rebuilding activity will occur during this test.

Pick a running VM. Next, navigate to [vSAN Cluster] > Monitor > Virtual Objects and find the running VM from the list shown and select a Hard Disk.

Select View Placement Details:

Identify a Component object. The column that we are most interested in is HDD Disk Name, as it contains the NAA SCSI identifier of the disk. The objective is to remove one of these disks from the host (other columns may be hidden by right-clicking on them).

From the figure above, let us say that we wish to remove the disk containing the component residing on 10.159.16.117. That component resides on physical disk with an NAA ID string of naa.5000cca08000d99c Make a note of your NAA ID string. Next, SSH into the host with the disk to pull. Inject a hot unplug event using the vsanDiskFaultInjection.pyc python script:

[root@10.159.16.117:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc –u –d naa.5000cca08000d99c

Injecting hot unplug on device vmhba2:C0:T5:L0

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1

vsish -e set /storage/scsifw/paths/vmhba2:C0:T5:L0/injectError 0x004C0400000002

Let’s now check out the VM’s objects and components and as expected, the component that resided on that disk on host 10.159.16.117 shows up as absent:

To put the disk drive back in the host, simply rescan the host for new disks. Navigate to the [vSAN host] > Configure > Storage > Storage Adapters and click the Rescan Storage button.

Look at the list of storage devices for the NAA ID that was removed. If for some reason, the disk doesn’t return after refreshing the screen, try rescanning the host again. If it still doesn’t appear, reboot the ESXi host. Once the NAA ID is back, clear any hot unplug flags set previously with the –c option:

[root@10.159.16.117:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc –c –d naa.5000cca08000d99c

Clearing errors on device vmhba2:C0:T5:L0

vsish -e set /storage/scsifw/paths/vmhba2:C0:T5:L0/injectError 0x00000

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

Pull Magnetic Disk/Capacity Tier SSD and Do not Replace before Timeout Expires

In this example, we shall remove the magnetic disk from the host, once again using the vsanDiskFaultInjection.pyc script. However, this time we shall wait longer than 60 minutes before scanning the HBA for new disks. After 60 minutes, vSAN will rebuild the components on the missing disk elsewhere in the cluster.

The same process as before can now be repeated. However, this time we will leave the disk drive out for more than 60 minutes and see the rebuild activity take place. Begin by identifying the disk on which the component resides.

[root@10.159.16.117:~] date

Thu Apr 19 11:17:58 UTC 2018

[root@cs-ie-h01:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc –u –d naa.5000cca08000d99c

Injecting hot unplug on device vmhba2:C0:T5:L0

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1

vsish -e set /storage/scsifw/paths/vmhba2:C0:T5:L0/injectError 0x004C0400000002

At this point, we can once again see that the component has gone "Absent". After 60 minutes have elapsed, the component should now be rebuilt.

After 60 minutes have elapsed, the component should be rebuilt on a different disk in the cluster. That is what is observed. Note the component resides on a new disk (NAA id is different).

The removed disk can now be re-added by scanning the HBA:

Navigate to the [vSAN host] > Configure > Storage Adapters and click the Rescan Storage button.

Look at the list of storage devices for the NAA ID that was removed. If for some reason, the disk doesn’t return after refreshing the screen, try rescanning the host again. If it still doesn’t appear, reboot the ESXi host. Once the NAA ID is back, clear any hot unplug flags set previously with the –c option:

[root@10.159.16.117:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc –c –d naa.5000cca08000d99c

Clearing errors on device vmhba2:C0:T5:L0

vsish -e set /storage/scsifw/paths/vmhba2:C0:T5:L0/injectError 0x00000

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

Pull Cache Tier SSD and Do Not Reinsert/Replace

For the purposes of this test, we shall remove an SSD from one of the disk groups in the cluster. Navigate to the [vSAN cluster] > Configure > vSAN > Disk Management. Select a disk group from the top window and identify its SSD in the bottom window. If All-Flash, make sure it’s the Flash device in the “Cache” Disk Role. Make a note of the SSD’s NAA ID string.

In the above screenshot, we have located an SSD on host 10.159.16.116 with an NAA ID string of naa.5000cca04eb0a4b4. Next, SSH into the host with the SSD to pull. Inject a hot unplug event using the vsanDiskFaultInjection.pyc python script:

[root@10.159.16.116:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -u -d naa.5000cca04eb0a4b4

Injecting hot unplug on device vmhba2:C0:T0:L0

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1

vsish -e set /storage/scsifw/paths/vmhba2:C0:T0:L0/injectError 0x004C0400000002

Now we observe the impact that losing an SSD (flash device) has on the whole disk group.

And finally, let’s look at the components belonging to the virtual machine. This time, any components that were residing on that disk group are "Absent".

If you search all your VMs, you will see that each VM that had a component on the disk group 10.159.16.116 now has absent components. This is expected since an SSD failure impacts the whole of the disk group.

After 60 minutes have elapsed, new components should be rebuilt in place of the absent components. If you manage to refresh at the correct moment, you should be able to observe the additional components synchronizing with the existing data.

To complete this POC, re-add the SSD logical device back to the host by rescanning the HBA:

Navigate to the [vSAN host] > Configure > Storage > Storage Adapters and click the Rescan Storage button.

Look at the list of storage devices for the NAA ID of the SSD that was removed. If for some reason, the SSD doesn’t return after refreshing the screen, try rescanning the host again. If it still doesn’t appear, reboot the ESXi host. Once the NAA ID is back, clear any hot unplug flags set previously with the –c option:

[root@10.159.16.116:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc –c –d naa.5000cca04eb0a4b4

Clearing errors on device vmhba2:C0:T0:L0

vsish -e set /storage/scsifw/paths/vmhba2:C0:T0:L0/injectError 0x00000

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

Warning: If you delete an SSD drive that was marked as an SSD, and a logical RAID 0 device was rebuilt as part of this test, you may have to mark the drive as an SSD once more.

Checking Rebuild/Resync Status

To display details on resyncing components, navigate to [vSAN cluster] > Monitor > vSAN > Resyncing Objects.

Injecting a Disk Error

The first step is to select a host and then select a disk that is part of a disk group on that host. The –d DEVICENAME argument requires the SCSI identifier of the disk, typically the NAA id. You might also wish to verify that this disk does indeed contain VM components. This can be done by selecting the [vSAN Cluster] > Monitor > Virtual Objects > [select VMs/Objects] > View Placement Details > Group components by host placement button.

The objects on each host can also be seen via [vSAN Cluster] > Monitor > vSAN > Physical Disks and selecting a host:

The error can only be injected from the command line of the ESXi host. To display the NAA ids of the disks on the ESXi host, you will need to SSH to the ESXi host, log in as the root user, and run the following command:

[root@10.159.16.117:~] esxcli storage core device list| grep ^naa

naa.5000cca08000ab0c

naa.5000cca04eb03560

naa.5000cca08000848c

naa.5000cca08000d99c

naa.5000cca080001b14

Once a disk has been identified, and has been verified to be part of a disk group, and that the disk contains some virtual machine components, we can go ahead and inject the error as follows:

[root@10.159.16.117:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -p -d naa.5000cca08000848c

Injecting permanent error on device vmhba2:C0:T2:L0

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1

vsish -e set /storage/scsifw/paths/vmhba2:C0:T2:L0/injectError 0x03110300000002

Before too long, the disk should display an error and the disk group should enter an unhealthy state, as seen in [vSAN cluster] > Configure > vSAN > Disk Management

Notice that the disk group is in an "Unhealthy" state and the status of the disk is “Permanent disk failure”. This should place any components on the disk into a degraded state (which can be observed via the "Physical Placement" window and initiate an immediate rebuild of components. Navigating to [vSAN cluster] > Monitor > vSAN > Resyncing Components should reveal the components resyncing.

Clear a Permanent Error

At this point, we can clear the error. We use the same script that was used to inject the error, but this time we provide a –c (clear) option:

[root@10.159.16.117:~] python /usr/lib/vmware/vsan/bin/vsanDiskFaultInjection.pyc -c -d naa.5000cca08000848c

vsish -e get /reliability/vmkstress/ScsiPathInjectError

vsish -e set /storage/scsifw/paths/vmhba2:C0:T2:L0/injectError 0x00000

vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x00000

Note however that since the disk failed, it will have to be removed, and re-added from the disk group. This is very simple to do. Simply select the disk in the disk group and remove it by clicking on the icon highlighted below.

This will display a pop-up window regarding which action to take regarding the components on the disk. You can choose to migrate the components or not. By default, it is shown as Evacuate all data to other hosts.

For the purposes of this POC, you can select the No data evacuation option as you are adding the disk back in the next step. When the disk has been removed and re-added, the disk group will return to a healthy state. That completes the disk failure test.

When Might a Rebuild of Components Not Occur?

There are a couple of reasons why a rebuild of components might not occur. Start by looking at vSAN Health Check UI [vSAN cluster] > Monitor > vSAN > Health for any alerts indicative of failures.

You could also check specifically for resource constraints or failures through RVC as described below.

Lack of Resources

Verify that there are enough resources to rebuild components before testing with the following RVC command:

vsan.whatif_host_failures

Of course, if you are testing with a 3-node cluster, and you introduce a host failure, there will be no rebuilding of objects. Once again, if you have the resources to create a 4-node cluster, then this is a more desirable configuration for evaluating vSAN.

Underlying Failures

Another cause of a rebuild not occurring is due to an underlying failure already present in the cluster. Verify there are none before testing with the following RVC command:

vsan.hosts_info
vsan.check_state
vsan.disks_stats

If these commands reveal underlying issues (ABSENT or DEGRADED components for example), rectify these first or you risk inducing multiple failures in the cluster, resulting in inaccessible virtual machines.

PCIe Hotplug

NVMe has helped usher in all-new levels of performance capabilities for storage systems. vSphere 7 introduces a feature that meets or exceeds the capability associated with older SAS and SATA devices: Hotplug support for NVMe devices in vSphere and vSAN. This introduces a new level of flexibility and serviceability to hosts populated with NVMe devices, improving uptime by simplifying maintenance tasks around adding, removing, and relocating storage devices in hosts. Modern hosts can potentially have dozens of NVMe devices, and the benefits of hotplug most help environments large and small.

Minimum requirement:

vSphere 7.0 or above
Hardware support by hardware vendor (Server system)

Hotplug support for any PCIe device (network/storage card), please visit our vSAN HCL (link) to verify supportability and requires appropriate driver and firmware in order to function.

Verify ESXi hypervisor support

PCIe option is enabled by default as a kernel option and can be verified via command line on the ESXi host:

zcat /var/log/boot.gz |grep Hotplug

Example output:
TSC: 560230 cpu0:1)BootConfig: 711: enablePCIEHotplug = TRUE (1)

TSC: 563288 cpu0:1)BootConfig: 711: forceOSCGrantPCIEHotplug = FALSE (0)

TSC: 566560 cpu0:1)BootConfig: 711: enableACPIPCIeHotplug = FALSE (0)

2021-11-01T13:09:43.019Z cpu0:2097152)PCI: 209: enablePCIErrors: 0, enableValidPCIDevices: 1, pciSetBusMaster: 1, disableACSCheck: 0, disableACSCheckForRP: 1, enablePCIEHotplug: 1

2021-11-01T13:09:43.019Z cpu0:2097152)PCI: 212: pciBarAllocPolicy: 2, disablePciPassthrough: 0, enableACPIPCIeHotplug: 0

Verify PCI-E devices and support

PCIe hotplug support requires support by the PCIe slot in order to allow the device the capability. PCI slot ID can be used to verify device and slot identifier, as an example:

dmesg|grep -I pcie|less

2020-04-10T10:28:41.113Z cpu0:131072)PCIE: 480: 0000:00:18.7: claimed by PCIe port module.
2020-04-10T10:28:41.113Z cpu0:131072)PCIEHP: 1952: 0000:00:18.7: hotplug slot:0x107 has NO adapter.
2020-04-10T10:28:41.114Z cpu0:131072)PCIEHP: 1837: 0000:00:18.7: Enabled HP events (0x1029) for hotplug slot:0x107“"esxcli hardware pci lis”" assists on the hardware status and identifying PCIe slot and card.

Hot-add and Hot-removal process
Hot-add procedure

ESXi 7.0 release and above follow the standard hot plug controller process and can be categorized into two processes, surprised and planned PCIe device hot-add.

Surprise hot-add

The device is inserted into the hot-plug slot without prior notification without the Attention button or software interface (UI/CLI) mechanism.

Step	User Action	ESXi Action	Power Indicator
1	User selects an empty, disabled slot and inserts a PCIe device	Platform/PCI hotplug layer detects the new additional hardware and notifies the ESXi device manger to scan for hot-added devices. In case of any failure, the Power Indicator goes OFF.	BLINKS
2	User waits for the slot to be enabled	PCI bus driver enumerates the hot-added device and registers it with the vSphere device manager	ON

Planned hot-add

Step	User Action	ESXi Action	Power Indicator
1	User selects an empty, disabled slot and inserts a PCIe device		OFF
2	User presses attention button / issues software UI/CLI command to enable the slot	In case of software interface (UI/CLI), there is no provision to aport to hot-add request, so once the command is issued control direly jumps to Step 4 In case of attention button, PCIe hotplug layer waits for ABORT INTERVAL (=5sec)	BLINKS
3	User cancels the operation by pressing the attention button a second time within ABORT INTERVAL	If canceled, the Power Indicator goes back to previous state OFF	OFF
4	No user action in the ABORT INTERVAL	PCIe hotplug layer validates the hot-add operation, powers the slot. On success, it notifies the ESXi device manager to scan for the hot-added device(s). in case of any failure, the Power Indicator goes back to previous state OFF	BLINKS
5	User waits for the slot to be enabled	PCI bus driver enumerates the hot-added device and registers it with the ESXi device manager.	ON

Note: After these steps, the ESXi device manager attaches the devices to the driver and the storage stack, presents the HBA, and the associated disk(s) to the upper layer, for example vSAN/VMFS.

Hot-remove procedure

Surprise hot-remove

In this case, the drive is removed without any prior notification through attention button or UI/CLI. If the user did not run preparatory steps, data consistency cannot be guaranteed. In the case of failed drives, the scenario is the same as abrupt removal without the preparatory steps, in which case no data consistency can be guaranteed.

Step	User Action	ESXi Action	Power Indicator
1	User selects an enabled slot with a PCIe device to be removed.	ESXi executes the requested preparatory steps for the drive corresponding to this device and flags as an error if unable to perform any step. User can choose to skip preparatory steps and directly remove the device in which case data consistency cannot be guaranteed.	ON
2	User removes the PCIe device	Platform/PCIe hot-unplug layer detects the device removal and notifies the ESXi device manager to remove the device. In case of any failure, the Power Indicator goes OFF. ESXi device manager issues a series of quiesce instructions, detach from all the drivers (storage stack, device driver, etc...), and finally remove the PCI bus driver. In case of any failure, the Power Indicator goes back to the previous state ON indicating that the device cannot be removed.	BLINKS
3	User waits for the slot to become disabled	PCIe bus driver removes the device from the system and power down the PCI slot.	OFF

Planned hot-remove

It is expected that the user runs the preparatory steps to ensure the data consistency, before initiating hot remove operation via the attention button/software interface (UI/CLI). Even in this case, if the user does not run preparatory steps, data consistency cannot be guaranteed.

Step	User Action	ESXi Action	Power Indicator
1	User selects an enabled slot with a PCIe device to be removed and initiates preparatory steps.	ESXi executes the requested preparatory steps for the drive corresponding to this PCIe device and flags an error if unable to perform any step.	ON
2	User presses Attention Button/issues software UI command to disable the slot	In the case of software interface (UI/CLI), there is no provision to abort the hot-remove request, so once the command is issued, control directly jumps to Step 5. PCI Hot-unplug layer gets an interrupt and waits for ABORT INTERVAL (= 5 second).	BLINKS
3	User can cancel the operation by pressing the Attention Button a second time	The Power Indicator goes back to previous state ON	ON
4	No user action in the ABORT INTERVAL	PCI Bus driver removes the device from the system and power down the slot.	OFF
5	User waits for the slot to be disabled	PCI Bus driver removes the device from the system and power down the slot.	OFF
6	User removes the PCIe device		OFF

Note: KBs:

Air-gapped Network Failures

Air-gapped vSAN network design is built around the idea of redundant, yet completely isolated storage networks. It is used in conjunction with multiple vmknics tagged for vSAN traffic, while each vmknic is segregated on different subnets. There is physical and/or logical separation of network switches. A primary use case is to have two segregated uplink paths and non-routable VLANs to separate the IO data flow onto redundant data paths.

Note: This feature does not guarantee the load-balancing of network traffic between VMkernel ports. Its sole function is to tolerate link failure across redundant data paths.

Air-gapped vSAN Networking and Graphical Overview

The figure below shows all the vmnic uplinks per hosts, including mapped VMkernel ports that are completely separated by physical connectivity and have no communication across each data path. VMkernel ports are logically separated either by different IP segments and/or separate VLANs in different port groups.

Distributed switch configuration for the failure scenarios

With air-gapped network design, we need to place each of the two discrete IP subnets on a separate VLAN. With this in mind, we would need the two highlighted VLANs/port groups created in a standard vSwitch or distributed vSwitch as shown below before the failover testing.

Separating IP segments on different VLANs

The table below shows the detailed vmknic/portgroup settings to allow for two discrete IP subnets to reside on two separate VLANs (201 and 202). Notice each vmknic port group is configured with two uplinks and only one of which is set to active.

VMkernel interface tagged for vSAN traffic	IP address segment and subnet mask	Port group name	VLAN ID	Port group uplinks
vmk1	192.168.201.0 / 255.255.255.0	VLAN-201-vSAN-1	201	UPLINK 1 - Active UPLINK 2 - Unused
vmk2	192.168.202.0 / 255.255.255.0	VLAN-202-vSAN-2	202	UPLINK 1 - Unused UPLINK 2 - Active

Failover test scenario using DVS portgroup uplink priority

Before we initiate a path failover, we need to generate some background workload to maintain a steady network flow through the two VMkernel adapters. You may choose your own workload tool or simply refer to the previous section to execute an HCIbench workload.

Using the functionality in DVS, we can simulate a physical switch failure or physical link down by moving an "Active" uplink for a port group to "Unused" as shown below. This affects all vmk ports that are assigned to the port group.

Expected outcome on vSAN IO traffic failover

Prior to vSAN 6.7, when a data path is down in air-gapped network topology, VM IO traffic could pause up to 120 seconds to complete the path failover while waiting for the TCP timeout signal. Starting in vSAN 6.7, failover time improves significantly to no more than 15 seconds as vSAN proactively monitors failed data path and takes corrective action as soon as a failure is detected.

Monitoring network traffic failover

To verify the traffic failover from one vmknic to another and capture the timeout window, we can start esxtop on each ESXi host and press "n" to actively monitor host network activities before and after a failure is introduced. The screenshot below illustrates that the data path through vmk2 is down when the "Unused" state is set for the corresponding uplink and "void" status is reported for that physical uplink. TCP packet flow has suspended on that vmknic as zeroes are reported under the Mb/s transmit (TX) and receive (RX) columns.

It is expected that vSAN health check reports failed pings on vmk2 as we set the vmnic1 uplink to "Unused".

To restore the failed data path after a failover test, modify the affected uplink from "Unused" back to "Active". Network traffic should be restored through both vmknics (though not necessarily load-balanced). This completes this section of the POC guide. Before moving on to other sections, remove vmk2 on each ESXi host (as the vmknic is also used for other purposes in Stretched Cluster testing in a later section), and perform a vSAN health check and ensure all tests pass.

vSAN Management Tasks

Common management task in vSAN and how to complete them.

Maintenance Mode

In this section, we shall look at a number of management tasks, such as the behavior when placing a host into maintenance mode, and the evacuation of a disk and a disk group from a host. We will also look at how to turn on and off the identifying LED's on a disk drive.

There are a number of options available when placing a host into maintenance mode. The first step is to identify a host that has a running VM, as well as components belonging to virtual machine objects.

Select the Summary tab of the virtual machine to verify which host it is running on.

Then select the [vSAN cluster] > Monitor > Virtual Objects, then select the appropriate VM (with all components) and click View Placement Details. Selecting Group components by host placement will show which hosts have been used. Verify that there are components also residing on the same host.

From the screenshots shown here, we can see that the VM selected is running on host 10.159.17.3 and also has components residing on that host. This is the host that we shall place into maintenance mode.

Right-click on the host, select Maintenance Mode from the drop-down menu, then select the option Enter Maintenance Mode as shown here.

There are three options displays when the maintenance mode is selected:

Full data migration
Ensure accessibility
No data migration

When the default option of "Ensure accessibility" is chosen, a popup is displayed regarding migrating running virtual machines. Since this is a fully automated DRS cluster, the virtual machines should be automatically migrated.

After the host has entered maintenance mode, we can now examine the state of the components that were on the local storage of this host. What you should observe is that these components are now in an “Absent” state. However, the VM remains accessible as we chose the option “Ensure Accessibility” when entering maintenance mode.

Beginning in vSAN 7U1 we introduced the RAID_D (Delta) Component. This was done to increase vSAN resiliency. When a host goes into maintenance mode the Delta component captures the latest writes to another host. When the host comes out of maintenance mode the latest writes are applied to stale components. This is used with the ensure accessibility option. This helps protect against another failure while you have a host in maintenance mode.

The host can now be taken out of maintenance mode. Simply right click on the host as before, select Maintenance Mode > Exit Maintenance Mode.

After exiting maintenance mode, the “Absent” component becomes "Active" once more. This is assuming that the host exited maintenance mode before the vsan.ClomdRepairDelay expires (default 60 minutes).

We shall now place the host into maintenance mode once more, but this time instead of Ensure accessibility, we shall choose Full data migration. This means that although components on the host in maintenance mode will no longer be available, those components will be rebuilt elsewhere in the cluster, implying that there is full availability of the virtual machine objects.

Note: This is only possible when NumberOfFailuresToTolerate=1 and there are 4 or more hosts in the cluster. It is not possible with 3 hosts and NumberOfFailuresToTolerate=1, as another host needs to be available to rebuild the components. This is true for higher values of NumberOfFailuresToTolerate also.

Now if the VM components are monitored, you will see that no components are placed in an “Absent” state, but rather they are rebuilt on the other hosts in the cluster. When the host enters maintenance mode, you will notice that all components of the virtual machines are "Active", but none resides on the host placed into maintenance mode.

Safeguards are in place such that warnings are shown when multiple hosts are placed into maintenance mode (MM) at the same time, or a host is about to enter MM while another host is already in MM or resync activity is in progress to avoid multiple unintended outages that may cause vSAN objects to become inaccessible. The screenshot below illustrates an example of the warnings if we attempt to place host 10.159.16.116 into MM, while 10.159.16.115 had already been in MM. Simply select CANCEL to abort the decommission operation.

Ensure that you exit maintenance mode of all the hosts to restore the cluster to a fully functional state. This completes this part of the POC.

Remove and Evacuate a Disk

In this example, we show the ability to evacuate a disk prior to removing it from a disk group.

Navigate to [vSAN cluster] > Configure > vSAN > Disk Management and select a disk group in one of the hosts as shown below. Then select one of the capacity disks from the disk group, also shown below. Note that the disk icon with the red X becomes visible. This is not visible if the cluster is in automatic mode.

Make a note of the devices in the disk group, as you will need these later to rebuild the disk group. There are a number of icons on this view of disk groups in vSAN. It is worth spending some time understanding what they mean. The following table should help to explain that.

	Add a disk to the selected disk group
	See the expected result of the disk or disk group evacuation
	Remove (and optionally evacuate data) from a disk in a disk group
	Turn on the locator LED on the selected disk
	Turn off the locator LED on the selected disk

To continue with the option of removing a disk from a disk group and evacuating the data, click on the icon to remove a disk highlighted earlier. This pops up the following window, which gives you the option to Evacuate all data to other hosts (selected automatically). Click DELETE to continue:

When the operation completes, there should be one less disk in the disk group, but if you examine the components of your VMs, there should be none found to be in an “Absent” state. All components should be “Active”, and any that were originally on the disk that was evacuated should now be rebuilt elsewhere in the cluster.

Evacuate a Disk Group

Let’s repeat the previous task for the rest of the disk group. Instead of removing the original disk, let’s now remove the whole of the disk group. Make a note of the devices in the disk group, as you will need these later to rebuild the disk group.

As before, you are prompted as to whether you wish to evacuate the data from the disk group or not. The amount of data is also displayed, and the option is selected by default. Click DELETE to continue.

Once the evacuation process has completed, the disk group should no longer be visible in the "Disk Group" view.

Once again, if you examine the components of your VMs, there should be none found to be in an “Absent” state. All components should be “Active”, and any that were originally on the disk that was evacuated should now be rebuilt elsewhere in the cluster.

Add Disk Groups Back Again

At this point, we can recreate the deleted disk group. This was already covered in section 5.2 of this POC guide. Simply select the host that the disk group was removed from and click on the icon to create a new disk group. Once more, select a flash device and the two magnetic disk devices that you previously noted were members of the disk group. Click CREATE to recreate the disk group.

Turning On/Off Disk LEDs

Our final maintenance task is to turn on and off the locator LEDs on the disk drives. For turning on and off the disk locator LEDs, utility such as hpssacli is a necessity when using HP controllers. Refer to vendor documentation for information on how to locate and install this utility.

Note: This is not an issue for LSI controllers, and all necessary components are shipped with ESXi for these controllers.

The icons for turning on and off the disk locator LEDs are shown in section 10.2. To turn on a LED, select a disk in the disk group and then click on the icon highlighted below.

This will launch a task to “Turn on disk locator LEDs”. To see if the task was successful, go to the "Monitor" tab and check the "Events" view. If there is no error, the task was successful. At this point, you can also look at the data center and visually check if the LED of the disk in question is lit.

Once completed, the locator LED can be turned off by clicking on the “Turn off disk locator LEDs” as highlighted in the screenshot below. Once again, this can be visually checked in the data center if you wish.

This completes this section of the POC guide. Before moving on to other sections perform and final check and ensure that all tests pass

Lifecyle Management

Lifecycle management is a time-consuming task. It is common for admins to maintain their infrastructure with many tools that require specialized skills. VMware customers currently use two different interfaces for day two operations: vSphere Update Manager (VUM) for software and drivers and server vendor-provided utility for firmware updates. VMware HCI sets the foundation for a new, unified mechanism to update software and firmware management that is native to vSphere called vSphere Lifecycle Manager (vLCM).

vLCM is built off a desired-state model that provides lifecycle management for the hypervisor and the full stack of drivers and firmware for the servers powering your data center. vLCM can be used to apply an image, monitor the compliance, and remediate the cluster if there is a drift. This reduces the effort to monitor compliance for individual components and helps maintain a consistent state for the entire cluster in adherence to the VMware Compatibility Guide (VCG). vLCM is a powerful new approach to creating simplified consistent server lifecycle management at scale.

Beginning with vSAN 7U1 vLCM is aware of fault domains and supports both 2 node and stretched cluster deployments.

Note: Check VMware Compatibility Guide to ensure that the hardware used in the POC is supported with vSphere Lifecycle Manager(vLCM).

Using VLCM to set the desired image for a vSAN cluster

vLCM is based on a desired-state or declarative model which allows the user to define a desired image (ESXi version, drivers, firmware) and apply it to an entire vSphere or HCI cluster. Once defined and applied, all hosts in the cluster will be imaged with the desired state.

A vLCM Desired Image consists of a base ESXi image (required), vendor add-ons, and firmware and driver addons.

Base Image: The desired ESXi version that can be pulled from vSphere depot or manually uploaded.
Vendor Addons: Packages of vendor specified components such as firmware and drivers.

Note: Firmware and driver add-ons are not distributed through the official VMware online depot or as offline bundles available at my.vmware.com. For a given hardware vendor, firmware updates are available in a special vendor depot, whose content you access through a software module called a hardware support manager. The hardware support manager is a plug-in that registers itself as a vCenter Server extension. Each hardware vendor provides and manages a separate hardware support manager that integrates with vSphere. Consult vendor documentation based on the hardware to deploy and integrate hardware support manager appliances. As of vSAN 7U1 there are three hardware vendors supported: Dell, HP, and Lenovo.

In this section, we enable vSphere Lifecycle Manager(vLCM) to establish the desired state image for the cluster:

Click on MANAGE WITH A SINGLE IMAGE to initiate vSphere Lifecycle Manager on the cluster

You can choose to set up an image or import an existing image (such as a custom ISO), Click on SETUP IMAGE

Define Image

Choose the appropriate ESXi Version
(Optional) Select the relevant vendor add-ons
(Optional) Select Firmware and Drivers Addon
Click on VALIDATE and SAVE

Click on CHECK COMPLIANCE to check if the components in the image are compliant with the server

Click on FINISH IMAGE SETUP to complete setting the desired image to the cluster

Note: This workflow enables vLCM on the cluster and disables any VUM based updates and baselines, this process is irreversible.

vLCM using Hardware Support Manager (HSM)

In the previous section, an image was created to be used by vLCM to continuously check against and reach the desired state. However, this step only covers the configuration of the ESXi image. To fully take advantage of vLCM, repositories can be configured to obtain firmware and drivers, among others, by leveraging the vendor's HSM.

In this example, Dell OpenManage Integration for VMware vCenter (OMIVV) will be mentioned. Deploying and configuring HSM will not be covered in this guide, as this varies by vendor.

Overview of steps within HSM prior to vLCM integration (steps may vary)

Deploy HSM appliance
Register HSM plug-in with vCenter
Configure hosts' credentials through a cluster profile
Configure Repository Profile (where vLCM will get Firmware and drivers)

Firmware and Drivers Addon

To configure Firmware and Drivers Addon within vLCM, follow the steps below:

In vCenter select Cluster -> Updates

Edit Image

Edit Firmware and Drivers Addon

Select the desired HSM, then select firmware and driver addon (previously created profile in HSM, and then save the image settings.

Image compliance check will initiate and the option to remediate will be available

Stretched Cluster

Design and Overview

Good working knowledge of how vSAN Stretched Cluster is designed and architected is assumed. Readers unfamiliar with the basics of vSAN Stretched Cluster are urged to review the relevant documentation before proceeding with this part of the proof-of-concept. Details on how to configure a vSAN Stretched Cluster are found in the vSAN Stretched Cluster Guide.

vSAN 7 introduces new intelligence to minimize impact due to capacity strained conditions. When an imbalance is detected, vSAN checks multiple parameters based on which it limits the IO to the capacity-constrained site and redirects active IO to the healthy site. Additionally, vSAN 7 enhances the replacement and resynchronizing logic of a vSAN Witness Host for Stretched Cluster and 2-node topologies.

With the release of vSAN 7U3 witnesses can now be shared between sites. Each witness appliance supports up to 64 clusters. Before vSAN 7U3 it was a 1-1 ratio for witness and cluster.

7.0U3 introduced LCM integration for vSAN Witness (link)

Deployment of the witness is a simple deployment of an OVF:

After deployment you need to add the witness as a new host into your datacenter; it cannot be added to a cluster. Once the host has been added you are ready to configure vSAN on your cluster.

vSAN 7.0U2 has some additional enhancements to stretched clusters. Now vSAN will:

Prioritize I/O read locality over VM site affinity rules
Instructs DRS not to migrate VMs to desired site until resyncs complete
Reduces I/O across ISL (Inter-switch link/uplink) in recovery conditions which should improve read performance and free up ISL for resyncs to regain compliance
Supports larger stretched clusters of 20+20+1

The first time you configure vSAN with a witness host you will claim the disks used by the witness, when adding additional clusters this step is skipped.

This is covered in greater detail in the vSAN Stretched Cluster Guide.

Stretched Cluster Network Topology

As per the vSAN Stretched Cluster Guide, several different network topologies are supported for vSAN Stretched Cluster. The network topology deployed in this lab environment is a full layer 3 stretched vSAN network. L3 IP routing is implemented for the vSAN network between data sites, and L3 IP routing is implemented for the vSAN network between data sites and the witness site. VMware also supports stretched L2 between the data sites. The VM network should be a stretched L2 between both data sites.

When it comes to designing stretched cluster topology, there are options to configure layer 2 (same subnet) or layer 3 (routed) connectivity between the three sites (2 active/active sites and a witness site) for different traffic types (i.e. vSAN data, witness, VM traffic) with/without Witness Traffic Separation (WTS) depending on the requirements. You may consider some of the common designs listed below. Options 1a and 1b are configurations without WTS. The only difference between them is whether L2 or L3 is deployed for vSAN data traffic. As option 2 utilizes WTS, that is the only difference compared to option 1a. For simplicity, all options use L2 for VM traffic.

In the next sections, we will cover configurations and failover scenarios using option 1a (without WTS) and option 2 (with WTS). During a POC, you may choose to test one or another, or both options if you wish.

For more information on network design best practices for the stretched cluster, refer to the vSAN Stretched Cluster Guide on core.vmware.com.

Stretched Cluster Hosts

There are four ESXi hosts in this cluster, two ESXi hosts on data site A (the “preferred” site), and two hosts on data site B (the “secondary” site). There is one disk group per host. The witness host/appliance is deployed in a 3^rdremote data center. The configuration is referred to as 2+2+1.

VMs are deployed on both the “Preferred” and “Secondary” sites of the vSAN Stretched Cluster. VMs are running/active on both sites.

Below is a diagram detailing the POC environment used for the Stretched Cluster testing.

This configuration uses L3 IP routing for the vSAN network between all sites.
Static routes are required to enable communication between sites.
The vSAN network VLAN for the ESXi hosts on the preferred site is VLAN 4. The gateway is 172.4.0.1.
The vSAN network VLAN for the ESXi hosts on the secondary site is VLAN 3. The gateway is 172.3.0.1.
The vSAN network VLAN for the witness host on the witness site is VLAN 80.
The VM network is stretched L2 between the data sites. This is VLAN 30. Since no VMs are run on the witness, there is no need to extend this network to the third site.

Stretched Cluster Network Configuration

As per the vSAN Stretched Cluster Guide, several different network topologies are supported for vSAN Stretched Cluster. The options below provide some of the different for stretched cluster network configuration.

Option 1a:

L3 for witness traffic (without Witness Traffic Separation)
L2 for vSAN data traffic between 2 data sites
L2 for VM traffic

Option 1b:

L3 for witness traffic (without WTS)
L3 for vSAN data traffic between 2 data sites
L2 for VM traffic

Option 2:

L3 for witness traffic with WTS
L2 for vSAN data traffic between 2 data sites
L2 for VM traffic

vSAN Stretched Cluster (Without Witness Traffic Separation) Topology and Configuration

As per the vSAN Stretched Cluster Guide, several different network topologies are supported for vSAN Stretched Cluster.

The network topology deployed in this lab environment for our POC test case is layer 2 between the vSAN data sites and L3 between data sites and witness. ESXi hosts and vCenter are in the same L2 subnet for this setup. The VM network should be a stretched L2 between both data sites as the unique IP used by the VM can remain unchanged in a failure scenario.

Stretched Cluster Hosts

There are four ESXi hosts in this cluster, two ESXi hosts on data site A (the “preferred” site) and two hosts on data site B (the “secondary” site). There is one disk group per host. The witness host/appliance is deployed in a 3^rd, remote data center. The configuration is referred to as 2+2+1.

VMs are deployed on both the “Preferred” and “Secondary” sites of the vSAN Stretched Cluster. VMs are running/active on both sites.

vSAN Stretched Cluster Diagram

Below is a diagram detailing the POC environment used for the Stretched Cluster testing with L2 across Preferred and Secondary data sites.

This configuration uses L2 across data sites for vSAN data traffic, host management, and VM traffic. L3 IP routing is implemented between the witness site and the two data sites.
Static routes are required to enable communication between data sites and witness appliance.
The vSAN data network VLAN for the ESXi hosts on the preferred and secondary sites is VLAN 201. The gateway is 192.168.201.162.
The vSAN network VLAN for the witness host on the witness site is VLAN 203. The gateway is 192.168 203.162
The VM network is stretched L2 between the data sites. This is VLAN 106. Since no production VMs are run on the witness, there is no need to extend this network to the third site.

Preferred / Secondary Site Details

In vSAN Stretched Clusters, the “preferred” site simply means the site that the witness will ‘bind’ to in the event of an inter-site link failure between the data sites. Thus, this will be the site with the majority of VM components, so this will also be the site where all VMs will run when there is an inter-site link failure between data sites.

In this example, vSAN traffic is enabled on vmk1 on the hosts on the preferred site, which is using VLAN 201. For our failure scenarios, we create two DVS port groups and add the appropriate vmkernel port to each port group to test the failover behavior in a later stage.

Static routes need to be manually configured on these hosts. This is because the default gateway is on the management network, and if the preferred site hosts tried to communicate to the secondary site hosts, the traffic would be routed via the default gateway and thus via the management network. Since the management network and the vSAN network are entirely isolated, there would be no route.

L3 routing between vSAN data sites and Witness host requires an additional static route. While default gateway is used for the Management Network on vmk0, vmk1 has no knowledge of subnet 192.168.203.0, which needs to be added manually.

Commands to Add Static Routes

The command to add static routes is as follows:

esxcli network ip route ipv4 add -g LOCAL-GATEWAY -n REMOTE-NETWORK

To add a static route from a preferred host to the witness host in this POC:

esxcli network ip route ipv4 add -g 192.168.201.162 -n 192.168.203.0/24

Note: Prior to vSAN version 6.6, multicast is required between the data sites but not to the witness site. If L3 is used between the data sites, multicast routing is also required. With the advent of vSAN 6.6, multicast is no longer needed.

Witness Site details

The witness site only contains a single host for the Stretched Cluster, and the only VM objects stored on this host are “witness” objects. No data components are stored on the witness host. In this POC, we are using the witness appliance, which is an “ESXi host running in a VM”. If you wish to use the witness appliance, it should be downloaded from VMware. This is because it is preconfigured with various settings and comes with a preinstalled license. Note that this download requires a login to My VMware.

With the release of vSAN 7U1 witnesses can now be shared between sites for 2 node deployments, and each witness appliance supports up to 64 clusters. Before vSAN 7.0U1 it was a 1-1 ratio for witness and cluster. Deployment of the witness is a simple deployment of an OVF:

After deployment you need to add the witness as a new host into your datacenter; it cannot be added to a cluster. Once the host has been added you are ready to configure vSAN on your cluster.

The first time you configure vSAN with a witness host you will claim the disks used by the witness, when adding additional clusters this step is skipped.

This is covered in greater detail in the vSAN Stretched Cluster Guide.

Alternatively, customers can use a physical ESXi host for the witness.

While two VMkernel adapters can be deployed on the Witness Appliance (vmk0 for Management and vmk1 for vSAN traffic), it is also supported to tag both vSAN and Management traffic on a single VMkernel adapter (vmk0), as we use in this case, vSAN traffic would need to be disabled on vmk1, since only one vmk has vSAN traffic enabled.

Once again, static routes should be manually configured on vSAN vmk0 to route to “Preferred site” and “Secondary Site" (VLAN 201). The image below shows the witness host routing table with static routes to remote sites.

Commands to Add Static Routes

The following command to add static routes is as follows:

esxcli network ip route ipv4 add -g LOCAL-GATEWAY -n REMOTE-NETWORK

To add a static route from the witness host to hosts on the preferred and secondary sites in this POC:

esxcli network ip route ipv4 add -g 192.168.203.162 -n 192.168.201.0/24

Note: Witness Appliance is a nested ESXi host and requires the same treatment as a standard ESXi host (i.e, patch updates). Keep all ESXi hosts in a vSAN cluster at the same update level, including the Witness appliance.

vSphere HA Settings

vSphere HA plays a critical part in Stretched Cluster. HA is required to restart virtual machines on other hosts and even the other site depending on the different failures that may occur in the cluster. The following section covers the recommended settings for vSphere HA when configuring it in a Stretched Cluster environment.

Response to Host Isolation

The recommendation is to “Power off and restart VMs” on isolation, as shown below. In cases where the virtual machine can no longer access the majority of its object components, it may not be possible to shut down the guest OS running in the virtual machine. Therefore, the “Power off and restart VMs” option is recommended.

Admission Control

If a full site fails, the desire is to have all virtual machines run on the remaining site. To allow a single data site to run all virtual machines if the other data site fails, the recommendation is to set Admission Control to 50% for CPU and Memory as shown below.

Advanced Settings

The default isolation address uses the default gateway of the management network. This will not be useful in a vSAN Stretched Cluster when the vSAN network is broken. Therefore, the default isolation response address should be turned off. This is done via the advanced setting das.usedefaultisolationaddress to false.

To deal with failures occurring on the vSAN network, VMware recommends setting at least one isolation address, which is local to each of the data sites. In this POC, we only use Stretched L2 on VLAN 201, which is reachable from the hosts on the preferred and secondary sites. Use advanced settings das.isolationaddress0 to set the isolation address for the IP gateway address to reach the witness host.

Since 6.5 there is no need for VM anti-affinity rules or VM to host affinity rules in HA

These advanced settings are added in the Advanced Options > Configuration Parameter section of the vSphere HA UI. The other advanced settings get filled in automatically based on additional configuration steps. There is no need to add them manually.

VM Host Affinity Groups

The next step is to configure VM/Host affinity groups. This allows administrators to automatically place a virtual machine on a particular site when it is powered on. In the event of a failure, the virtual machine will remain on the same site, but placed on a different host. The virtual machine will be restarted on the remote site only when there is a catastrophic failure or a significant resource shortage.

To configure VM/Host affinity groups, the first step is to add hosts to the host groups. In this example, the Host Groups are named Preferred and Secondary, as shown below.

The next step is to add the virtual machines to the host groups. Note that these virtual machines must be created in advance.

Note that these VM/Host affinity rules are “should” rules and not “must” rules. “Should” rules mean that every attempt will be made to adhere to the affinity rules. However, if this is not possible (due lack of resources), the other site will be used for hosting the virtual machine.

Also, note that the vSphere HA rule setting is set to “should”. This means that if there is a catastrophic failure on the site to which the VM has an affinity, HA will restart the virtual machine on the other site. If this was a “must” rule, HA would not start the VM on the other site.

The same settings are necessary on both the primary VM/Host group and the secondary VM/Host group.

DRS Settings

In this POC, the partially automated mode has been chosen. However, this could be set to Fully Automated if customers wish but note that it should be changed back to partially automated when a full site failure occurs. This is to avoid failback of VMs occurring while rebuild activity is still taking place. More on this later.

vSAN Stretched Cluster Local Failure Protection

In vSAN 6.6, we build on resiliency by including local failure protection, which provides storage redundancy within each site and across sites. Local failure protection is achieved by implementing local RAID-1 mirroring or RAID-5/6 erasure coding within each site. This means that we can protect the objects against failures within a site, for example, if there is a host failure on site 1, vSAN can self-heal within site 1 without having to go to site 2 if properly configured.

Local Failure Protection in vSAN 6.6 is configured and managed through a storage policy in the vSphere Web Client. The figure below shows rules in a storage policy that is part of an all-flash stretched cluster configuration. The "Site disaster tolerance" is set to Dual site mirroring (stretched cluster), which instructs vSAN to mirror data across the two main sites of the stretched cluster. The "Failures to tolerate" specifies how data is protected within the site. In the example storage policy below, 1 failure - RAID-5 (Erasure Coding) is used, which can tolerate the loss of a host within the site.

Local failure protection within a stretched cluster further improves the resiliency of the cluster to minimize unplanned downtime. This feature also reduces or eliminates cross-site traffic in cases where components need to be resynchronized or rebuilt. vSAN lowers the total cost of ownership of a stretched cluster solution as there is no need to purchase additional hardware or software to achieve this level of resiliency.

vSAN Stretched Cluster Site Affinity

New in vSAN 6.6 flexibility improvements of storage policy-based management for stretched clusters have been made by introducing the “Affinity” rule. You can specify a single site to locate VM objects in cases where cross-site redundancy is not necessary. Common examples include applications that have built-in replication or redundancy such as Microsoft Active Directory and Oracle Real Application Clusters (RAC). This capability reduces costs by minimizing the storage and network resources used by these workloads.

Site affinity is easy to configure and manage using storage policy-based management. A storage policy is created and the Affinity rule is added to specify the site where a VM’s objects will be stored using one of these "Site disaster tolerance" options: None - keep data on Preferred (stretched cluster) or None - keep data on Non-preferred (stretched cluster).

vSAN Stretched Cluster Preferred Site Override

Preferred and secondary sites are defined during cluster creation. If it is desired to switch the roles between the two data sites, you can navigate to [vSAN cluster] > Configure > vSAN > Fault Domains, select the "Secondary" site in the right pane and click the highlighted button as shown below to switch the data site roles.

vSAN Stretched Cluster and 2 Node (Without Witness Traffic Separation)

Failover Scenarios

In this section, we will look at how to inject various network failures in a vSAN Stretched Cluster configuration. We will see how the failure manifests itself in the cluster, focusing on the vSAN health check and the alarms/events as reported in the vSphere client.

Network failover scenarios for Stretched Cluster with or without Witness Traffic separation and ROBO with/without direct connect are the same because the Witness traffic is always connected via L3.

Scenario #1: Network Failure between Secondary Site and Witness host

Trigger the Event

To make the secondary site lose access to the witness site, one can simply remove the static route on the witness host that provides a path to the secondary site.

On secondary host(s), remove the static route to the witness host. For example:

esxcli network ip route ipv4 remove -g 192.168.201.162 -n 192.168.203.0/24

Cluster Behavior on Failure

In such a failure scenario, when the witness is isolated from one of the data sites, it implies that it cannot communicate to both the master node AND the backup node. In stretched clusters, the master node and the backup node are placed on different fault domains [sites]. This is the case in this failure scenario. Therefore, the witness becomes isolated, and the nodes on the preferred and secondary sites remain in the cluster. Let's see how this bubbles up in the UI.

To begin with, the Cluster Summary view shows one configuration issue related to "Witness host found".

This same event is visible in the [vSAN Cluster] > Monitor > Issues and Alarms > All Issues view.

Note that this event may take some time to trigger. Next, looking at the health check alarms, a number of them get triggered.

On navigating to the [vSAN Cluster] > Monitor > vSAN > Health view, there are a lot of checks showing errors.

One final place to examine is virtual machines. Navigate to [vSAN cluster] > Monitor > vSAN > Virtual Objects > View Placement Details. It should show the witness absent from the secondary site perspective. However, virtual machines should still be running and fully accessible.

Returning to the health check, selecting Data > vSAN object health, you can see the error 'Reduced availability with no rebuild - delay timer'

Conclusion

The loss of the witness does not impact the running virtual machines on the secondary site. There is still a quorum of components available per object, available from the data sites. Since there is only a single witness host/site, and only three fault domains, there is no rebuilding/resyncing of objects.

Repair the Failure

Add back the static routes that were removed earlier and rerun the health check tests. Verify that all tests are passing before proceeding. Remember to test one thing at a time.