vSAN Operations Guide
Introduction
VMware vSAN provides enterprise-class storage that is robust, flexible, powerful, and easy to use. vSAN aggregates locally attached storage devices to create a storage solution that can run at the edge, the core, or the cloud—all easily managed by vCenter Server. vSAN is integrated directly into the hypervisor. It provides abilities and integration that is unlike traditional three-tier architectures.
While vSAN-powered clusters share many similarities to vSphere clusters in a three-tiered architecture, the unique abilities and architecture of vSAN means that some operational practices and recommendations may be different than that of traditional environments.
This document provides concise, practical guidance in the day-to-day operations of vSAN-powered clusters. It augments the step-by-step instructions found in VMware Docs, KB articles, and other detailed guidance found core.vmware.com. This operations guide is not intended to be "how-to" documentation. It offers general guidance and recommendations applicable to a large majority of environments. Requirements unique to a specific environment may dictate slightly different operational practices, thus the reason for no single "best practice." New topics may be added periodically. Please check to ensure the latest copy is used.
The guidance provided in this document reflects recommendations in accordance with the latest version of vSAN at the time of this writing: vSAN 8 Update 1 (U1). Some of the guidance will apply exclusively to the Original Storage Architecture (OSA) of vSAN, exclusively to the Express Storage Architecture (ESA) introduced in vSAN 8 and enhanced in vSAN 8 U1, or both. Efforts have been made to clarify which architecture(s) the guidance applies to, New features in vSAN will often impact operational recommendations. When guidance differs based on recent changes introduced to vSAN, it will be noted. The guidance will not retain an ongoing history of practices for previous versions of vSAN.
Section 1: Cluster
Create a vSAN Cluster
Since vSAN is a cluster-based solution, creating a cluster is the first logical step in the deployment of the solution. Unlike traditional three-tier architectures, vSAN storage is treated as a resource of the cluster, which offers unique capabilities in cluster design. More information on these concepts can be found at: vSAN Cluster Design – Large Clusters Versus Small Clusters. The guidance generally applies to both the Original Storage Architecture (OSA) and Express Storage Architecture (ESA) in vSAN, with perhaps some slight changes in the user interface.
Create a vSphere cluster
The first step of creating a vSAN cluster is creating a vSphere cluster.
- Right-click a data center and select New Cluster.
- Type a name for the cluster in the Name box.
- Configure VMware Distributed Resource Scheduler (DRS), vSphere High Availability (HA), and vSAN for the cluster, and click OK.
FIGURE 1-1: Available configuration options when creating a new vSAN cluster
Note that all vSAN clusters participating in HCI Mesh (a feature introduced in vSAN 7 U1 and enhanced in vSAN 7 U2) must also configure the “All Paths Down (APD) Failure Response” found in “HA Cluster Settings” in vCenter Server, and ensure it is enabled and configured for proper APD handling. When enabled and configured correctly, the isolation events related to the connectivity between the client and the server cluster, or within the client cluster will result in the VM on the client cluster being terminated and restarted by HA. The APD failure response can be set to “Power off and restart VMs - Aggressive restart policy” or “Power off and restart VMs - Conservative restart policy.” This HA cluster setting is not required for vSAN clusters that do not participate in HCI Mesh. For more information on HCI Mesh, see the HCI Mesh Tech Note.
For enhanced levels of availability and resilience, vSAN 7 U2 provides support for "vSphere Proactive HA." This feature allows for detection (through an OEM plugin supplied by the vendor) of impending failures on a vSAN host and will proactively evacuate the VMs and the vSAN object data off of the host. It can not only improve the application uptime, but also increase the time in which an application is fully compliant with the prescribed storage policy. For more information, see the post: Proactive HA, Health Checks and Alarms for a Healthy vSAN Datastore.
Adding hosts to a vSphere cluster
The second step is to add hosts to the newly created cluster. There are two methods available. The traditional method is to right-click on the cluster and select Add hosts. The new streamlined method is to use the Cluster Quickstart wizard. The Cluster Quickstart wizard can be found by clicking on the existing cluster and selecting: Configure → Configuration → Quickstart. Hosts can be added by using the Add hosts wizard
- On the Add hosts page, enter information for new hosts, or click Existing hosts and select from hosts listed in the inventory.
- On the Host summary page, verify the host settings.
- On the Ready to complete page, click Finish.
FIGURE 1-2: Adding more than one host at a time using the Add hosts wizard
The selected hosts are placed into maintenance mode and added to the cluster. When you complete the Quickstart configuration, the hosts exit maintenance mode. Note that if you are running vCenter Server on a host in the cluster, you do not need to place the host into maintenance mode as you add it to a cluster using the Quickstart workflow. The host that contains the vCenter Server virtual machine (VM) must be running VMware ESXi 6.5 EP2 or later. The same host can also be running a Platform Services Controller. All other VMs on the host must be powered off.
Recommendation: Take advantage of vSAN’s flexibility. The initial sizing of a cluster does not need to be perfect. The value of vSAN is that you have the flexibility to scale up, scale out, or reconfigure as needed.
Verify vSAN health checks
Once the hosts are added to the cluster, the vSAN health checks verify that the host has the necessary drivers and firmware. Note that if time synchronization fails, the next step allows you to bulk configure Network Time Protocol (NTP) on the hosts.
FIGURE 1-3: vSAN health checks performed during the time of adding hosts to a cluster
Cluster configuration
The third and final step to Quickstart is cluster configuration. On the Cluster configuration card, click Configure to open the Cluster configuration wizard.
- On the Configure the distributed switches page, enter networking settings, including distributed switches, port groups, and physical adapters. Network I/O Control is automatically created on all switches created. Make sure to upgrade existing switches if using a brownfield vDS.
- In the port groups section, select a distributed switch to use for VMware vSphere vMotion and a distributed switch to use for the vSAN network.
- In the physical adapters section, select a distributed switch for each physical network adapter. You must assign each distributed switch to at least one physical adapter. This mapping of physical network interface cards (NICs) to the distributed switches is applied to all hosts in the cluster. If you are using an existing distributed switch, the physical adapter selection can match the mapping of the distributed switch.
- On the vMotion and vSAN traffic page, it is strongly encouraged to provide dedicated VLANs and broadcast domains for added security and isolation of these traffic classes.
- On the Advanced options page, enter information for cluster settings, including DRS, HA, vSAN, host options, and Enhanced vMotion Compatibility (EVC). Setup is the ideal time to configure data-at-rest or data-in-transit encryption, deduplication & compression, or compression only. Configuring these settings upfront reduces the need to enable them at a later time by moving or disrupting data.
- Enable EVC for the most current generation of processors that the hosts in the cluster supports. The EVC and CPU Compatibility FAQ contains more information on this topic.
- On the Claim disks page, select disks on each host for cache and capacity.
- (Optional) On the Create fault domains page, define fault domains for hosts that can fail together. For more information about fault domains, see “Managing Fault Domains in vSAN Clusters” in “Administering VMware vSAN.”
- On the Ready to complete page, verify the cluster settings, and click Finish.
Summary
Creating vSAN clusters is not unlike the creation of a vSphere cluster in a three-tier architecture. Both use the “Cluster Quickstart” feature built into vCenter, which offers the ability to easily scale the cluster as needed.
Pre-Flight Check Prior to Introducing Cluster into Production
Introducing a new vSAN cluster into production is technically a very simple process. Features such as Cluster Quickstart and vSAN health checks help provide guidance to ensure proper configuration, while VM migrations to a production cluster can be transparent to the consumers of those VMs. Supplement the introduction of a new vSAN cluster into production with additional steps to ensure that, once the system is powering production workloads, you get the expected outcomes.
Preparation
Preparation helps reduce potential issues when VMs rely on the services provided by the cluster. It also helps establish a troubleshooting baseline. The following may be helpful in a cluster deployment workflow:
- Have the steps in the “vSAN Performance Evaluation Checklist” in the Proof of Concept (PoC) guide been followed? While this document focuses on adhering to recommended practices during the evaluation of the performance of vSAN during a PoC, it provides valuable guidance for any cluster entering into production.
- Is your cluster using the OSA, or ESA? The operational guidance may be quite different depending on the circumstances. For example, the Performance Recommendations for vSAN ESA may be very different than the OSA. For more information on the vSAN ESA, see our vSAN ESA dedicated landing page.
- Will VMs in this vSAN cluster require different storage policies than are used in other clusters? See “Using Storage Policies in Environments with More Than One vSAN Cluster” for more information. This is especially important for clusters using the OSA.
- What is the intention of this cluster? And what data services reflect those intentions? Are its VMs primarily focused on performance or space efficiency? Do VMs need to be encrypted on this cluster? Generally, cluster-wide data services are best enabled or disabled at the time the cluster is provisioned. This pertains primarily toward the OSA.
- Has host count in cluster size been sufficiently considered? Perhaps you planned on introducing a new 24-node cluster to the environment. You may want to evaluate whether a single cluster or multiple clusters are the correct fit. While this can be changed later, evaluating at initial deployment is most efficient. See “vSAN Cluster Design—Large Clusters Versus Small Clusters” on core.vmware.com.
Recommendation: Always run a synthetic test (HCIBench) as described in the vSAN Performance Evaluation Checklist prior to introducing the system into production. This can verify that the cluster behaves as expected and can be used for future comparisons should an issue arise, such as network card firmware hampering performance. See step 1 in the “Troubleshooting vSAN Performance” document for more information.
Summary
Design verification and vSAN cluster configuration can help reduce post-deployment issues or unnecessary changes. Follow the guidance found in the vSAN Performance Evaluation Checklist in the PoC guide for any cluster entering production. It contains the information necessary to deploy the cluster with confidence, and for potential troubleshooting needs.
Maintenance Work on L2/L3 Switching on Production Cluster
Redundant configuration
VMware vSAN recommends configuring redundant switches and either NIC teaming or failover so that the loss of one switch or path does not permanently cause a switch outage.
FIGURE 1-4: Virtual Switch and port group configuration
Health Findings
Prior to performing maintenance, review the vSAN networking health findings (renamed from "health checks" in vSphere 8 U1 and later). Health findings tied to connectivity, latency, or cluster partitions can help identify situations where one of the two paths is not configured correctly, or is experiencing a health issue.
FIGURE 1-5: Network-related health checks in the vSAN health UI in vCenter
Understanding the nature of the maintenance can also help you understand what health alarms to expect. Basic switch patching can sometimes be performed non-disruptively. Switch upgrades that can be performed as an in-service software upgrade (ISSU) may not be noticeable, while physically replacing a switch may lead to a number of connectivity alarms. Discuss the options with your networking vendor.
Testing failure impacts
It is a good idea to simulate a path failure on a single host (disable a single port) before taking a full switch offline. If VMs on that host become unresponsive, or if HA is triggered, this may imply an issue with pathing that should be resolved prior to switch removal or reboot.
Controlled maintenance
If fault domains are used with multiple racks of hosts using different switches, consider limiting maintenance to a single fault domain and verify its health before continuing on. For stretched clusters, limit maintenance to one side at a time to reduce potential impacts.
Summary
In a vSAN environment, configuration of virtual switches, and the respective uplinks used follows practices commonly recommended in traditional three-tier architectures. With the added responsibility of serving as the storage fabric, ensuring that the proper configuration is in place will help the abilities of vSAN to perform as expected.
Configuring Fault Domains
Each host in a vSAN cluster is an implicit fault domain by default. vSAN distributes data across fault domains (hosts) to provide resilience against drive and host failure. This is sufficient to provide the right combination of resilience and flexibility for data placement in a cluster in the majority of environments. There are use cases that call for fault domain definitions spanning across multiple hosts. Examples include protection against server rack failure, such as rack power supplies and top-of-rack networking switches.
vSAN includes an optional ability to configure explicit fault domains that include multiple hosts. vSAN distributes data across these fault domains to provide resilience against larger domain failure—an entire server rack, for example.
vSAN requires a minimum of three fault domains. At least one additional fault domain is recommended to ease data resynchronization in the event of unplanned downtime, or planned downtime such as host maintenance and upgrades. The diagram below shows a vSAN cluster with 24 hosts. These hosts are evenly distributed across six server racks.
FIGURE 1-6: A conceptual illustration of a vSAN cluster using 24 hosts and 6 explicit fault domains
With the example above, you would configure six fault domains—one for each rack—to help maintain access to data in the event of an entire server rack failure. This process takes only a few minutes using the vSphere Client. “Managing Fault Domains in vSAN Clusters” contains detailed steps for configuring fault domains in vSAN, and recommendations for "Designing and Sizing vSAN Fault Domains" are also available. The “Design and Operation Considerations When Using vSAN Fault Domains” post offers practical guidance for some of the most commonly asked questions when designing for vSAN fault domains.
Recommendation: Prior to deploying a vSAN cluster using explicit fault domains, ensure that rack-level redundancy is a requirement of the organization. Fault domains can increase the considerations in design and management, thus determining the actual requirement up front can result in a design that reflects the actual needs of the organization.
vSAN is also capable of delivering multi-level replication or “nested fault domains.” This is already fully supported with vSAN stretched cluster architectures. Nested fault domains provide an additional level of resilience at the expense of higher-capacity consumption. Redundant data is distributed across fault domains and within fault domains to provide this increased resilience to drive, host, and fault domain outages. Note that some features are not available when using vSAN's explicit fault domains. For example, the new reserved capacity functionality in the UI of vSAN 7 U1 is not supported in a topology that uses fault domains such as a stretched cluster, or a standard vSAN cluster using explicit fault domains.
Summary
Standard vSAN clusters using the explicit “Fault Domains” feature offers tremendous levels of flexibility to meet the levels of resilience required by an organization. They can introduce different operational and design considerations than that of a standard vSAN cluster not using this feature. Becoming familiar with these considerations will help you determine if they are a good fit for your organization.
Migrate to a Different vSAN Cluster Type
In some cases, an administrator may want to migrate a vSAN cluster built initially with spinning disks to an all-flash based vSAN cluster. The information below describes some of the considerations for an in-place migration.
Migrating from the OSA to ESA
As customers purchase new hardware that supports the ESA, questions arise in how best to migrate their environment to clusters running the ESA. For a detailed explanation on options available, see the document "Migrating to the Express Storage Architecture in vSAN 8"
Migrating hybrid cluster (OSA) to All-Flash cluster (OSA)
While this is becoming much more rare due to the dominance of all-flash clusters, occasionally customers may need to transition an OSA-based hybrid cluster to an OSA-based all-flash cluster. Review the supported process steps to cover this action. Identify if the disk controllers and cache devices currently in use can be reused for all-flash. Note that there may be newer driver/firmware certified for the controller for all-flash usage. Check the VMware Compatibility Guide (VCG) for vSAN for more information.
Confirm that the cluster has sufficient capacity once the migration is complete without requiring the use of deduplication and compression (DD&C) "compression only" or RAID-5/6. If space efficiency features are required, consider migrating some VMs outside the cluster until the conversion is completed. It is recommended to replace disk groups with the same or more capacity as part of the migration process if done in place.
Identify if you will be converting disk group by disk group, or host by host. If there is limited free capacity on the existing cluster, migrating disk group by disk group requires less free space. If migrating host by host, other actions (such as patching controller firmware and patching ESXi) can be included in this workflow to reduce the number of host evacuations required. Review existing Storage Policy Based Management (SPBM) policies for cache reservation usage. This policy is not supported on all-flash and leads to health alarms and failed provisioning. See “Unable to provision linked-clone pools on a vSAN all-flash cluster” for an example of this behavior.
FIGURE 1-7: An unsupported policy rule when transitioning from hybrid to all-flash vSAN
After the migration, identify what new data services will be enabled. You will need to do the full migration first before you enable any data services (cluster level services like DD&C, or object data placement schemes like RAID-5/6). The creation of new policies and migrating VMs is recommended over changing existing RAID-1 policies.
Recommendation. If possible, create a new cluster and simply perform a vMotion and Storage vMotion for the workloads that you'd like to migrate. This is a much cleaner approach, and applies to almost all scenarios.
Summary
Transitioning to different cluster types is not unusual as new hardware and technologies become available. The two paths above give a simple overview of common migrations, and how they can best be achieved.
Section 2: Network
Configuring NIOC for vSAN Bandwidth Management Using Shared Uplinks
vSphere Network I/O Control (NIOC) version 3 introduces a mechanism to reserve bandwidth for system traffic based on the capacity of the physical adapters on a host. It enables fine-grained resource control at the VM network adapter level, similar to the model used for allocating CPU and memory resources. NIOC is only supported on the VMware Distributed Switch (vDS) and is enabled per switch.
Planning the process
It is recommended to not enable limits. Limits artificially restrict vSAN traffic even when bandwidth is available. Reservations should also be avoided because reservations do not yield free bandwidth back for non-VMkernel port uses. On a 10Gbps interface uplink, a 9Gbps vSAN reservation would result in only 1Gbps of traffic available for VMs even when vSAN is not passing traffic. Limits also do not work well given that the ESA has much higher networking requirements than the OSA, thus limits would not be very adaptable to these conditions. For more information, see the post: "Designing vSAN Networks - 2022 Edition."
FIGURE 2-1: Setting shares in NIOC to balance network resources under contention
Shares are the recommended way to prioritize traffic for VMware vSAN. Raise the vSAN shares to “High.”
FIGURE 2-2: An example of a configuration of shares for a vSAN-powered cluster (OSA)
Other network quality of service (QoS) options
It is worth noting that NIOC only provides shaping services on the host’s physical interfaces. It does not provide prioritization in switch-to-switch links and does not have awareness of contention caused by over saturated leaf/spine uplinks, or data center–to–data center links for stretched clustering. Tagging a dedicated vSAN VLAN with class of service or DSCP can provide end-to-end prioritization. Discuss these options with your networking teams, and switch vendors for optimal configuration guidance.
Summary
Storage traffic needs low-latency reliable transport end to end. NIOC can provide a simple setup and powerful protection for vSAN traffic.
Creating and Using Jumbo Frames in vSAN Clusters
Jumbo frames are Ethernet frames larger than 1,500 bytes of payload. The most common jumbo configuration is a payload size of 9,000, although modern switches can often go up to 9,216 bytes.
Planning the process
Consult with your switch vendor and identify if jumbo frames are supported and what maximum transmission units (MTUs) are available. If multiple switch vendors are involved in the configuration, be aware they measure payload overhead in different ways in their configuration. Also identify if a larger MTU is needed to handle encapsulation such as VxLAN. Identify all configuration points that must be changed to support jumbo frames end to end. If Witness Traffic Separation is in use, be aware that an MTU of 1,500 may be required for the connection to the witness.
Implementing the change
Start the changes with the physical switch and distributed switch. To avoid dropped packets, make the change last to the VMkernel port adapters used for vSAN.
FIGURE 2-3: Changing the MTU size of virtual distributed switch (VDS)
Validation
The final step is to verify connectivity. To assist with this, vSAN: MTU check (ping with large packet size) will perform a ping test with large packet sizes from each host to all other hosts to verify connectivity end to end.
FIGURE 2-4: Verifying connectivity using the vSAN MTU check health check.
Summary
Jumbo frames can reduce overhead on NICs, and switch application-specific integrated circuits (ASICs). While modern NIC offload technologies can reduce this overhead, this can help improve CPU overhead associated with throughput and improve performance. The largest gains in performance for this should be expected on older, more basic NICs with fewer offload capabilities.
Create and Manage Broadcast Domains for Multiple vSAN Clusters
It is recommended, when possible, to dedicate unique broadcast domains (or collections of routed broadcast domains for Layer 3 designs) for vSAN. Benefits to unique broadcast domains include:
- Fault isolation—Spanning tree, configuration mistakes, entering duplicate IP address, and other failures can cause a broadcast domain to fail, or failures to propagate across a broadcast domain.
- Security—While vSAN hosts have automatic firewall rules created to reduce attack surface, data over the vSAN network is not encrypted unless by higher-level solutions (VM encryption, for example). To reduce the attack surface, restrict the broadcast domain to only contain VMkernel ports dedicated to the vSAN cluster. Dedicating isolated broadcast domains per cluster helps ensure security barriers between clusters.
Planning the process
- There are a number of ways to isolate broadcast domains.
- The most basic is physically dedicated and isolated interfaces and switching.
- The most commonly chosen is to tag VLANs onto the port groups used by the vSAN VMkernel ports. Prior to this, configure the switches between the hosts to carry this VLAN for these ports.
- Other encapsulation methods for carrying VLANs between routed segments (ECMP fabrics, VxLAN) are supported.
- NSX-V may not be used for vSAN or storage VMkernel port encapsulation.
- NSX-T may be used with VLAN backed port groups. (subject to versions. NSX-T 2.2 offers notable improvements in support of vSAN environments.)
Implementing the change
The first step is to configure the VLAN on the port group. This can also be set up when the VDS and port groups are created using the Cluster Quickstart.
FIGURE 2-5: Configuring a port group to use a new VLAN
Validation
A number of built-in health checks can help identify if a configuration problem exists, preventing the hosts from connecting. To ensure proper functionality, all vSAN hosts must be able to communicate. If they cannot, a vSAN cluster splits into multiple partitions (i.e., subgroups of hosts that can communicate but not to other subgroups). When that happens, vSAN objects might become unavailable until the network misconfiguration is resolved. To help troubleshoot host isolation, the vSAN network health checks can detect these partitions and ping failures between hosts.
Recommendation: VLAN design and management does require some levels of discipline and structure. Discuss with your network team the importance of having discrete VLANs for your vSAN clusters up front, so that it lays the groundwork for future requests.
FIGURE 2-6: Validating that changes pass network health findings
Summary
Configuring discrete broadcast domains for each respective cluster is a recommended practice for vSAN deployment and management. This helps meet levels of fault isolation and security with no negative trade-off.
Change IP Addresses of Hosts Participating in vSAN Cluster
vSAN requires networking between all hosts in the cluster for VMs to access storage and maintain the availability of storage. Operationally migrating IP addresses of storage networks need extensive care to prevent loss of connectivity to storage or loss of quorum to objects.
Planning the process
Identify if you do this as an online process or as a disruptive offline process (powering off all VMs). If disruptive, make sure to power off all VMs following the cluster shutdown guidance.
Implementing the change
If new VMkernel ports are used prior to removing old ones, a number of techniques can be used to validate networking and test hosts before removing the original VMkernel ports.
- Use vmkping to source pings between the new VMkernel ports.
- Put hosts into maintenance mode, or evacuate VMs before removing the original vSAN VMkernel port.
- Check the vSAN object health alarms to confirm that the cluster is at full health once the original VMkernel port has
- been removed.
- Once the host has left maintenance mode, vSphere vMotion® a test VM to the host and confirm that no health alarms are alerting
- before continuing to the next host.
Note that in vSAN 7 U3 and newer, there is a Skyline health check for vSAN that will detect duplicate IP addresses.
Validation
Before restoring the host to service, confirm that networking and object health is returning normal health.
Migrate vSAN traffic to different VMkernel port
There are cases where the vSAN network needs to be migrated from to a different segment. For example, the implementation of a new network infrastructure or the migration of vSAN standard cluster (non-routed network) to a vSAN stretched cluster (routed network). Recommendations and guidance on this procedure is given below.
Prerequisites
Check Skyline Health for vSAN to verify there are no issues. This is recommended before performing any planned maintenance operations on a vSAN cluster. Any issues discovered should be resolved before proceeding with the planned maintenance.
Set up the new network configuration on your vSAN hosts. This procedure will vary based on your environment. Consult "vSphere Networking" in the vSphere section of VMware Docs https://docs.vmware.com for the version of vSphere you are running.
Ensure that the new vSAN network subnet does not overlap with the existing one. vSphere will not allow the vSAN service to run simultaneously on two VMkernel ports on the same subnet. Attempting to do this using esxcli will produce an error like the one shown below.
esxcli vsan network ip add -i vmk2
Failed to add vmk2 to CMMDS: Unable to complete Sysinfo operation. Please see the VMkernel log file for more details.
Vob Stack: [vob.vsan.net.update.failed.badparam]: Failed to ADD vmknic vmk2 with vSAN because a parameter is incorrect.
Note that you might see warnings in Skyline Health as you add new VMkernel adapters with the vSAN service--specifically, the "vSAN: Basic (unicast) connectivity check" and "vSAN: MTU check (ping with large packet size)" health checks, as shown below. This is expected if the vSAN service on one host is not able to communicate with other hosts in the vSAN cluster. These warnings should be resolved after the new VMkernel adapters for vSAN have been added and configured correctly on all hosts in the cluster. Use the "Retest" button in vSAN Skyline Health to refresh the health checks status.
FIGURE 2-7: vSAN Skyline Health warnings
Use vmkping to verify the VMkernel adapter for the new vSAN network can ping the same VMkernel adapters on other hosts. This VMware Knowledge Base article provides guidance on using vmkping to test connectivity: https://kb.vmware.com/s/article/1003728
- Shut down all running virtual machines that are using the vSAN datastore. This will minimize traffic between vSAN nodes and ensure all changes are committed to the virtual disks before the migration occurs.
- After configuring the new vSAN network on every host in the vSAN cluster, verify the vSAN service is running on both VMkernel adapters. This can be seen in the UI by checking the Port Properties for both VMkernel adapters in the UI or by running esxcli vsan network list. You should see an output similar to the text below.
[root@host01:~] esxcli vsan network list Interface
VmkNic Name: vmk1
...
Traffic Type: vsan
Interface
VmkNic Name: vmk2
...
Traffic Type: vsan
- Click the "Retest" button in vSAN Skyline Health to verify there are no warnings while the vSAN service is enabled on both VMkernel adapters on every host. If there are warnings, it is most likely because one of more hosts do not have the vSAN service enabled on both VMkernel adapters. Troubleshoot the issue and use the "Retest" option in vSAN Skyline Health until all issues are resolved.
- Disable the vSAN service on the old VMkernel adapters.
- Click the "Retest" button in vSAN Skyline Health to verify there are no warnings.
- Power on the virtual machines.
Recommendation: While it is possible to perform this migration when VMs on the vSAN datastore are powered on, it is NOT recommended and should only be considered in scenarios where shutting down the workloads running on vSAN is not possible.
Summary
Migrating the vSAN VMkernel port is a supported practice that when done properly, can be accomplished successfully quickly and with a predictable outcome.
Also see VMware Knowledge Base article 76162: How to Non-Disruptively Change the VLAN for vSAN Data in a Production Environment
Introducing RDMA into a vSAN Environment
Running vSAN over RDMA introduces all new levels of capabilities in efficiency and performance. The support of RoCE v2 introduced in vSAN 7 U2 means that customers can explore this extremely fast method of network connectivity of vSAN hosts on a per cluster basis.
FIGURE 2-8: vSAN over RDMA
Recommendation: Ensure that a host added to a vSAN cluster running RDMA is in fact, fully compatible with RDMA. Adding a single host that is not compatible with RDMA will make the cluster fail back to using TCP over ethernet.
Introducing RDMA into an environment requires the use of certified hardware (RDMA NIC adapters and switches). See the vSAN VCG for RDMA Network Adapters for more information. vSAN clusters using RDMA may be subject to additional limitations of supported features or functionality, including, but not limited to:
- vSAN cluster sizes are limited to 32 hosts
- vSAN cluster must not be running the vSAN iSCSI services
- vSAN cluster must not be running in a stretched cluster configuration
- vSAN cluster must be using RDMA over Layer 2. RDMA over Layer 3 is not supported.
- vSAN cluster running vSAN over RDMA is not supported with HCI Mesh.
- vSAN cluster must not be using a teaming policy based on IP Hash or any active/active connection where sessions are balanced across two or more uplinks.
For clusters running the vSAN ESA, RDMA may provide even more beneficial, as often times the bottleneck in an ESA environment is the network, and not the storage stack in a server.
Summary
When configured properly, workloads running on clusters will enjoy all new levels of performance and efficiency when compared to the same workloads running in a vSAN cluster using traditional TCP over ethernet. Due diligence must be taken to ensure that the environment and cluster is properly configured to run RDMA.
Section 3: Storage Devices
Adding Capacity Devices to Existing Disk Groups (OSA)
Expanding a vSAN cluster is a non-disruptive operation. Administrators can add new disks, replace capacity disks with larger disks, or simply replace failed drives without disrupting ongoing operations.
FIGURE 3-1: Adding a capacity device to an existing disk group
When you configure vSAN to claim disks in manual mode, you can add additional local devices to existing disk groups. Keep in mind vSAN only consumes local, empty disks. Remote disks, such as SAN LUNs, and local disks with partitions cannot be used and won’t be visible. If you add a used device that contains residual data or partition information, you must first clean the device. Read information about removing partition information from devices. You can also run the host_wipe_vsan_disks command in Ruby vSphere Console (RVC) to format the device.
If performance is a primary concern, avoid adding capacity devices without increasing the cache, which reduces your cache-to-capacity ratio. Consider adding the new storage devices to a new disk group that includes an additional cache device.
Adding storage devices to existing disk group can be performed on clusters running DD&C, compression-only, or no cluster-based space efficiency efficiency features enabled. However, specifically with DD&C, we do recommend the following:
Recommendation: For optimal results on all-flash vSAN clusters with DD&C enabled, remove the disk group first and then recreate to include the new storage devices. This step is not necessary when using the "compression-only" feature.
After adding disks to an existing cluster, verify that the vSAN Disk Balance health check is green. If the Disk Balance health check issues a warning, perform a manual rebalance during off-peak hours.
Summary
Scale up a vSAN cluster by adding new storage devices either to a new disk group or to an existing disk group. Always verify storage devices are on the VMware Compatibility Guide. If adding to an existing disk group, consider the cache-to-capacity ratio, and always monitor the Disk Balance health check to ensure the cluster is balanced.
Adding Additional Devices in New Disk Group (OSA)
vSAN architecture consists of two tiers:
- A cache tier for read caching and write buffering
- A capacity tier for persistent storage
This two-tier design offers supreme performance to VMs while ensuring data is written to devices in the most efficient way possible. vSAN uses a logical construct called disk groups to manage the relationship between capacity devices and their cache tier.
FIGURE 3-2: Disk groups in a single vSAN host
A few things to understand about disk groups:
- Each host that contributes storage in a vSAN cluster contains at least 1 disk group.
- Disk groups contain at most 1 cache device and between 1 to 7 capacity devices.
- A vSAN host can have at most 5 disk groups, each containing up to 7 capacity devices, resulting in a maximum of 35 capacity
- devices for each host.
- Whether the configuration is hybrid or all-flash, the cache device must be a flash device.
- In a hybrid configuration, the cache device is used by vSAN as both a read cache (70%) and a write buffer (30%).
- In an all-flash configuration, 100% of the cache device is dedicated as a write buffer.
When you create a disk group, consider the ratio of flash cache to consumed capacity. The ratio depends on the requirements and workload of the cluster. For a hybrid cluster, consider using at least 10% of flash cache to consumed capacity ratio (not including replicas, such as mirrors). For guidance on determining the cache ratio for all-flash clusters, refer to the blog posts Designing vSAN Disk Groups - All Flash Cache Ratio Update and Write Buffer Sizing in vSAN when Using the Very Latest Hardware.
Recommendation: While vSAN requires at least one disk group per host contributing storage in a cluster, consider using more than one disk group per host.
Summary
Scale up a vSAN cluster by adding new storage devices either to a new disk group or to an existing disk group. Be sure to check the VMware Compatibility Guide for a list of supported PCIe flash devices, SSDs, and NVMe devices.
Recreating a Disk Group (OSA)
Disk groups form the basic construct that is pooled together to create the vSAN datastore. They may need to be recreated in some situations. It is most commonly done to remove stale data from the existing disks or as part of a troubleshooting effort.
The Recreate Disk Group process can be invoked by traversing to the Cluster→ Configure → Disk Management, as shown in FIGURE 3-3.
FIGURE 3-3: Recreating a vSAN disk group in the vCenter UI
The detailed procedure is described here.
vSAN automates the backend workflow of recreating the disk groups. Nonetheless, it is useful to understand the steps involved. Recreating a disk group involves:
- Evacuating data (wholly or partially) or deleting the existing data on disk
- Removing disk group from the vSAN cluster
- Rebuilding the disk groups and claiming the disks
The administrator can choose to migrate data from the disk group through Full data migration or the Ensure accessibility option. The third option, No data migration, simply purges the data and may cause some VMs to become inaccessible. For each of the chosen options, an assessment is performed to validate the impact on the objects’ compliance and determine if there is sufficient capacity to perform the intended migration.
Recommendation: “Ensure accessibility” validates that all objects are protected sufficiently and only moves components that are not protected to other disk groups in the cluster. This limits the migration to the minimal and “necessary” data to ensure VM availability elsewhere in the cluster. Selecting "Full data migration" ensures that all data is removed from the host or disk group(s) in question.
Summary
Recreating a disk group simplifies a multi-step process of removing a disk group, creating a new disk group, and adding disks back into one automated workflow. It also has guardrails in place to safely migrate data elsewhere in the cluster prior to rebuild.
Remove a Capacity Device (OSA)
vSAN architecture comprises a cache tier and a capacity tier to optimize performance. A combination of one cache device and up to seven capacity devices make up a disk group. There are common scenarios, such as hardware upgrades or failures, where disks may need to be removed from a disk group for replacement. While the process of replacing a device is relatively easy, exercising caution throughout the process will help ensure that there is not a misunderstanding in the device replacement process. In particular, ensure the following:
- Ensure that the physical device desired to be removed is correctly identified in the host. Server vendors may have different methods for matching up their physical device with what is represented in the vCenter Server UI. This may even vary depending on the form factor of the server and/or chassis enclosure. This can be confusing especially if there are different teams responsible for hardware and software
- Entering the host into maintenance mode. This will ensure that there is no data being actively served by the host. If you wish to also ensure all VM objects meet their respective storage policy compliance, you may wish to choose a "full evacuation" when entering into maintenance mode and waiting until resynchronization are complete. Selecting "full evacuation" will migrate all data to the other hosts in the cluster. Selecting "Ensure Accessibility" will also suffice if you wait until resynchronization begin and complete after the default 60-minute timeout window.
vSAN also incorporates a mechanism to proactively detect unhealthy disks and unmount them. This can happen if a device exhibits some anomalies. It allows an administrator to validate the anomaly and remove or replace the affected device.
Remove a capacity device by going to Cluster → Configure → Disk Management. On clicking a disk group, the associated devices are listed in the bottom pane, as shown:
FIGURE 3-4: Removing a vSAN capacity disk in the vCenter UI
Recommendation: If the device is being removed permanently, perform Full data migration. This ensures that objects remain compliant with the respective storage policies. Use LED indicators to identify the appropriate device that needs to be removed from the physical hardware.
An abrupt device removal would cause an all-paths-down (APD) or permanent-device-loss (PDL) situation. In such cases, vSAN would trigger error-handling mechanisms to remediate the failure.
Recommendation: Maintain a runbook procedure that reflects the steps based on your server vendor. The guidance provided here does not include any step-by-step instructions for the replacement of devices based on the server hardware.
Summary
vSAN eases maintenance activities such as hardware upgrades by abstracting disk removal workflow in the UI. The guardrails to assess object impact and usage of LED indicators minimize the possibility of user errors. Effectively, the entire set of capacity devices can be removed, replaced, or upgraded in a cluster with zero downtime.
Remove a Disk Group (OSA)
vSAN enables an administrator to granularly control the addition and removal of a disk group from a vSAN datastore. This allows for greater flexibility and agility to carry out maintenance tasks non-disruptively. Removing a disk group effectively reduces the corresponding capacity from the vSAN datastore. Prior to removing a disk group, ensure there is sufficient capacity in the cluster to accommodate the migrated data.
Initiate removing a disk group by traversing to Cluster → Configure → Disk Management. On clicking a disk group, the Remove this disk group option is enabled in the UI, as shown:
FIGURE 3-5: Removing a vSAN disk group in the vCenter U
The administrator can choose from:
- Full data migration
- Ensure accessibility
- No data migration
Full data migration would evacuate the disk group completely. Ensure accessibility moves unprotected components. No data migration would not migrate any data and removes the disk group directly.
Recommendation: Full data migration is recommended to evacuate the disk group. This ensures that objects remain compliant with the respective storage policies.
Modifying disk group composition or carrying out maintenance tasks would likely cause an imbalance in data distribution across the cluster. This is an interim condition because some hosts may contribute more capacity than others. To achieve optimal performance, restore the cluster to the identical hardware configuration across hosts.
Summary
The ability to manage a disk group individually provides a modular approach to sizing and capacity management in a vSAN cluster. The entire set of disk groups in a vSAN cluster can be removed, replaced, or upgraded without any intrusion to the workloads running on the cluster.
Secure Erase of Data on a Decommissioned vSAN Storage Device
Just as with many storage systems, discrete storage devices decommissioned from a storage system typically need an additional step to meet the National Insitute of Standards and Technology (NIST) to ensure that all data previously stored on a device can no longer be accessed. This involves a step often referred to as "secure erase" or "secure wipe." The goal of a secure wipe is to prevent data spillage, which could occur if a system or device was repurposed to a less sensitive environment. It also plays a critical role in a declassification procedure, which may involve the formal demotion of the hardware to a less secure environment. The method discussed here achieves a properly and securely erased device for both of those purposes.
vSAN 7 U1 introduces a new approach to this secure wipe process. It can be achieved through API or PowerCLI, with the latter being a much more convenient option for administrators. It should be the final step in the decommissioning process if the requirements dictate this level of security. To ensure the protection of data occurring as a result of an inadvertent command, the wipe option will only be supported if the "Evacuate" all data" was chosen at the time of removing the disk from the disk group.
Recommendation: Be patient. The secure wipe procedure may take some time. Claiming the device in vSAN must wait for the secure wipe process to complete.
PowerCLI command Syntax
The PowerCLI commands for wiping a disk will include:
-
Wipe-Disk – Given a list of disks, issues a wipe disk. Syntax: Wipe-Disk –Disk <Disk[]> -RunAsync
-
Query-Wipe-Status – Given a list of disks, returns a lists of wipe disk status. Syntax: Query-Wipe-Status <Disk[]>
-
Abort-Wipe-Disk – Given a list of disks, cancel the sanitization of them and return the status. Syntax: Abort-Wipe-Disk <Disk[]>
The following image shows an example of these secure wipe commands.
FIGURE 3-6: Example of the secure wipe commands
The disk wipe activity log will capture:
- Date/time wipe initiated
- Date/time wipe completed
- Status of job.
- Relevant information as to which host and cluster the activity occurred
- Status of success or failure
Support and Compatibility
There is a heavy reliance on system and device capabilities in order to support the above commands and capabilities. Therefore, some older generations of hardware/servers may not be capable of supporting the capability. Please check the VMware Compatibility Guide (VCG) for vSAN to determine which ReadyNodes support the feature. Secure wipe may only be applicable to some systems using NVMe, SATA, and SAS based devices. The support of a secure wipe is limited to flash devices only. This functionality does not apply to spinning disks.
Summary
Security is more than just limiting access and encrypting data. Many organizations must follow the regulatory requirements of decommissioning hardware, including the scrubbing of all data from storage devices. The secure wipe commands described above helps provide an easy and effective method for achieving this result.
Section 4: vSAN Datastore
Maintaining Sufficient Free Space for Resynchronizations
vSAN requires free space set aside for operations such as host maintenance mode data evacuation, component rebuilds and rebalancing operations. This free space also accounts for capacity needed in the event of a host outage. Activities such as rebuilds and rebalancing can temporarily consume additional raw capacity. While a host is in maintenance mode, it reduces the total amount of raw capacity a cluster has. The local drives do not contribute to vSAN datastore capacity until the host exits maintenance mode.
The requirements and operational guidance for free space fall into two categories.
- All vSAN versions prior to vSAN 7 U1. This free space required for these transient operations was referred to as "slack space." The limitations of vSAN in versions prior to vSAN 7 U1 meant that there was a generalized recommendation of free space as a percentage of the cluster (25-30%), regardless of the cluster size. See the post “Revisiting vSAN’s Free Capacity Recommendations” (in versions prior to vSAN 7 U1) for a more detailed understanding of slack space.
- All vSAN versions including and after vSAN 7 U1. The free space required for these transient operations are now referred to as "Reserved Capacity." This is comprised of two elements: "Operations Reserve" and "Host Rebuild Reserve." (The term "slack space" is no longer applicable for vSAN 7 U1 and later). Significant improvements were included in vSAN 7 U1 reduce the required capacity necessary for efficient vSAN operations. See the post "Effective Capacity Management with vSAN 7 U1" for a more detailed understanding of "Operations Reserve."
What is the recommended amount of free capacity needed for environments running vSAN 7 U1 and later? The actual amount is highly dependent on the configuration of the cluster. When sizing a new cluster, the vSAN Sizer has this logic built in. Do not use any manually created spreadsheets or calculators, as these will no longer accurately calculate the free capacity requirements for vSAN 7 U1 and later. For existing environments, turning on the "Enable Capacity Reserve" option (found in the "Configure" screen of the vSAN cluster capacity view) will provide the actual capacity needed for a cluster.
The Reserved Capacity functionality applies to and is included in the vSAN Express Storage Architecture (ESA). It behaves in functionally the same way. There are some slight differences in Capacity Overheads for the ESA in vSAN 8 but at this time they are relatively minor.
Recommendation: The "reserved capacity" functionality is an optional toggle that is not enabled in a vSAN cluster by default for new or for existing clusters that were upgraded. To ensure sufficient free capacity to meet your requirements, it is recommended to turn it on if your vSAN topology and configuration supports it.
Transient space for policy changes
Two cases where storage policy changes can temporarily consume more capacity:
- When a new policy requires a change in component number and/or layout is assigned to a VM
- When an existing storage policy that is assigned to one or more VMs is modified
In both cases, vSAN uses the additional capacity to make the necessary changes to components to comply with the assigned storage policy. Consider the following example.
A 100GB virtual disk is assigned a storage policy that includes a rule of Failures to Tolerate (FTT) = 1 using RAID-1 mirroring. vSAN creates two full mirrors (“replicas”) of the virtual disk and places them on separate hosts. Each replica consists of one component. There is also a Witness component created, but Witness components are very small—typically around 2MB. The two replicas for the 100GB virtual disk objects consume up to 200GB of raw capacity. A new storage policy is created: FTT=1 using RAID-5/6 erasure coding. The new policy is assigned to that same 100GB virtual disk. vSAN copies the mirrored components to a new set distributed in a RAID-5 erasure coding configuration. Data integrity and availability are maintained as the mirrored components continue to serve reads and writes while the new RAID-5 set is built.
This naturally consumes additional raw capacity as the new components are built. Once the new components are built, I/O is transferred to them and the mirrored components are deleted. The new RAID-5 components consume up to 133GB of raw capacity. This means all components for this object could consume up to 333GB of raw capacity before the resynchronization is complete and the RAID-1 mirrored components are deleted. After the RAID-1 components are deleted, the capacity consumed by these components is automatically freed for other use. Note that in an HCI Mesh environment, if the VMs experience resynchronizations due to a storage policy change or compliance activity, this temporary space used for the transient activity will occur on the datastore where the VM objects reside. In other words, for a VM running in a client vSAN cluster, the resynchronization activity and capacity adjustments will occur in the server cluster.
FIGURE 4-1: Illustrating the temporary use of free space as an object’s storage policy is changed
As you can imagine, performing this storage policy change on multiple VMs concurrently could cause a considerable amount of additional raw capacity to be consumed. Likewise, if a storage policy assigned to many VMs is modified, more capacity could be needed to make the necessary changes. This is one more reason to maintain sufficient free space in a vSAN cluster. Especially if changes occur frequently or impact multiple VMs at the same time.
vSAN rebalancing
When one or more storage devices are more than 80% used, vSAN automatically initiates a reactive rebalance of the data across vSAN storage devices to bring it below 80%. This rebalancing generates additional I/O in the vSAN cluster. Maintaining the appropriate amount of free space minimizes the need for rebalancing while accommodating temporary fluctuations in use due to the activities mentioned above.
Summary
Running with a level of free space is not a new concept in infrastructure design. For all versions of vSAN up to and including vSAN 7, VMware recommends that disk capacity maintain 25–30% free space to avoid excessive rebuild and rebalance operations. For all clusters running vSAN 7 U1 and later, VMware recommends using the vSAN sizer to accurately calculate the required capacity needed for transient operations and host failures.
Maintaining Sufficient Space for Host Failures
vSAN needs free space for operations such as host maintenance mode data evacuation, component rebuilds, rebalancing operations, and VM snapshots. Activities such as rebuilds and rebalancing can temporarily consume additional raw capacity.
The ability to restore an object to its desired level of compliance for protection is a primary vSAN duty. When an object is reported as absent (e.g., disk or host failure), the object remains available but not in a redundant state. vSAN identifies components that go absent and begins a repair process to satisfy the original protection policy. Having enough free space is important for rebuilding failed hosts and devices.
The requirements and operational guidance for free space fall into two categories.
- All vSAN versions prior to vSAN 7 U1. This free space required for these transient operations was referred to as "slack space." The limitations of vSAN in versions prior to vSAN 7 U1 meant that there was a generalized recommendation of free space as a percentage of the cluster (25-30%), regardless of the cluster size. See the post “Revisiting vSAN’s Free Capacity Recommendations” (in versions prior to vSAN 7 U1) for a more detailed understanding of slack space.
- All vSAN versions including and after vSAN 7 U1. The free space required for these transient operations are now referred to as "Reserved Capacity." This is comprised of two elements: "Operations Reserve" and "Host Rebuild Reserve." (The term "slack space" is no longer applicable for vSAN 7 U1 and later). Significant improvements were included in vSAN 7 U1 reduce the required capacity necessary for efficient vSAN operations. See the post "Effective Capacity Management with vSAN 7 U1" for a more detailed understanding of "Operations Reserve."
With the "Reserved Capacity" function new in vSAN 7 U1, the "Host Rebuild Reserve" is responsible for ensuring the appropriate amount of N+1 free capacity should a sustained host failure occur. Unlike previous editions of vSAN, the Host Rebuild Reserve is proportional to the size of the vSAN cluster. Larger vSAN clusters will require proportionally less host rebuild reserve than smaller clusters. Using the vSAN Sizer will calculate this value for new clusters, and enabling the feature (found in the "Configure" screen of the vSAN cluster capacity view) will provide the actual host rebuild reserve capacity needed for a cluster.
Recommendation: The "reserved capacity" functionality is an optional toggle that is not enabled in a vSAN cluster by default for new or for existing clusters that were upgraded. To ensure sufficient free capacity to meet your requirements, it is recommended to turn it on if your vSAN topology and configuration supports it.
FIGURE 4-2: Illustrating how free space is critical for repairs, rebuilds, and other types of resynchronization traffic
Summary
Running with a level of free space is not a new concept in infrastructure design. For all versions of vSAN up to and including vSAN 7, VMware recommends that disk capacity maintain 25–30% of free space to avoid excessive rebuild and rebalance operations. For all clusters running vSAN 7 U1 and later, VMware recommends using the vSAN sizer to accurately calculate the required capacity needed for transient operations and host failures.
Automatic Rebalancing in a vSAN Cluster
vSAN 6.7 U3 introduced a new method for automatically rebalancing data in a vSAN cluster. Some customers have found it curious that this feature is disabled by default in 6.7 U3 as well as vSAN 7. Should it be enabled in a VCF or vSAN environment, and if so, why is it disabled by default? Let's explore what this feature is, how it works, and learn if it should be enabled.
Rebalancing in vSAN Explained
The nature of a distributed storage system means that data will be spread across participating nodes. vSAN manages all of this for you. Its cluster-level object manager is not only responsible for the initial placement of data, but ongoing adjustments to ensure that the data continues to adhere to the prescribed storage policy. Data can become imbalanced for many reasons: Storage policy changes, host or disk group evacuations, adding hosts, object repairs, or overall data growth.
vSAN's built-in logic is designed to take a conservative approach when it comes to rebalancing. It wants to avoid moving data unnecessarily. This would consume resources during the resynchronization process and may result in no material improvement. Similar to DRS in vSphere, the goal of vSAN's rebalancing is not to strive for perfect symmetry of capacity or load across hosts, but to adjust data placement to reduce the potential of contention of resources. Accessing balanced data will result in better performance as it reduces the potential of reduced performance due to resource contention.
vSAN offers two basic forms of rebalancing:
- Reactive Rebalancing. This occurs when vSAN detects any storage device that is near or at 80% capacity utilization and will attempt to move some of the data to other devices that fall below this threshold. A more appropriate name for this might be "Capacity Constrained Rebalancing." This feature has always been an automated, non-adjustable capability.
- Proactive Rebalancing. This occurs when vSAN detects any storage device is consuming a disproportionate amount of its capacity in comparison to other devices. By default, vSAN looks for any device that shows a delta of 30% or greater capacity usage than any other device. A more suitable name for this might be "Capacity Symmetry Rebalancing." Prior to vSAN 6.7 U3, this feature was a manual operation but has since been in introduced as an automated, adjustable capability.****
Rebalancing activity only applies to the discrete devices (or disk groups) in question, and not the entire cluster. In other words, if vSAN detects a condition that is above the described thresholds, it will move the minimum amount of data from those disks or disk groups to achieve the desired result. It does not arbitrarily shuffle all of the data across the cluster. Both forms of rebalancing are based entirely off of capacity usage conditions, not load or activity of the devices.
The described data movement by vSAN will never violate the storage policies prescribed to the objects. vSAN's cluster-level object manager handles all of this so that you don't have to.
Manual Versus Automated Operations
Before vSAN 6.7 U3, Proactive Rebalancing was a manual operation. If it detected a large variance, it would trigger a health alert condition in the UI, which would then present a "Rebalance Disks" button to remediate the condition. If clicked, a rebalance task would occur at an arbitrary time within the next 24 hours. Earlier editions of vSAN didn't have the proper controls in place to provide this as an automated feature. Clicking on the "Rebalance Disks" left some users uncertain if and when anything would occur. With the advancement of a new scheduler and Adaptive Resync introduced in 6.7, as well as all-new logic introduced in 6.7 U3 to calculate resynchronization completion times, VMware changed this feature to be an automated process.
The toggle for enabling or disabling this cluster-level feature can be found in vCenter, under Configure > vSAN > Services > Advanced options > "Automatic Rebalance" as shown in Figure 4-3.
FIGURE 4-3: Configuring "Automatic Rebalance" in the "Advanced Options" of the cluster
Recommendation: Keep the "Rebalancing Threshold %" entry to the default value of 30. Decreasing this value could increase the amount of resynchronization traffic and cause unnecessary rebalancing for no functional benefit.
The "vSAN Disk Balance" health check was also changed to accommodate this new capability. If vSAN detects an imbalance that meets or exceeds a threshold while automatic rebalance is disabled, it will provide the ability to enable the automatic rebalancing, as shown in Figure 4-4. The less-sophisticated manual rebalance operation is no longer available.
FIGURE 4-4: Remediating the health check condition when Automatic Rebalancing is disabled.
Once the Automatic Rebalance feature is enabled, the health check alarm for this balancing will no longer trigger, and rebalance activity will occur automatically.
Accommodating All Environments and Conditions
The primary objective of proactive rebalancing was to more evenly distribute the data across the discrete devices to achieve a balanced distribution of resources, and thus, improved performance. Whether the cluster is small or large, automatic rebalancing through the described hypervisor enhancements addresses the need for the balance of capacity devices in a scalable, sustainable way.
Other approaches are wrought with challenges that could easily cause the very issue that a user is trying to avoid. For example, implementing a time window for rebalancing tasks would assume that the associated resyncs would always impact performance – which is untrue. It would also assume the scheduled window would always be sufficiently long enough to accommodate the resyncs, which would be difficult to guarantee. This type of approach would delay resyncs unnecessarily by artificial constraints, increase operational complexity, and potentially decrease performance.
Should Automatic Rebalancing Be Enabled?
Yes, it is recommended to enable the automatic rebalancing feature on your vSAN clusters. When the feature was added in 6.7 U3, VMware wanted to introduce the capability slowly to customer environments and remains this way in vSAN 7. With the optimizations made to our scheduler and resynchronizations in recent editions, the feature will likely end up enabled by default at some point. There may be a few rare cases in which one might want to temporarily disable automatic rebalancing on the cluster. Adding a large number of additional hosts to an existing cluster in a short amount of time might be one of those possibilities, as well as perhaps nested lab environments that are used for basic testing. In most cases, automatic rebalancing should be enabled.
Viewing Rebalancing Activity
The design of vSAN's rebalancing logic emphasizes a minimal amount of data movement to achieve the desired result. How often are resynchronizations as the result of rebalancing occurring in your environment? The answer can be easily found in the disk group performance metrics of the host. Rebalance activity will show up under the "rebalance read" and "rebalance write" metrics An administrator can easily view the VM performance during this time to determine if there was any impact on guest VM latency. Thanks to Adaptive Resync, even under the worst of circumstances, the impact on the VM will be minimal. In production environments, you may find that proactive rebalancing does not occur very often.
Summary
The automatic rebalancing feature found in VCF environments powered by vSAN 6.7 U3 and later, is a powerful new way to ensure optimal performance through the proper balance of resources and can be enabled without hesitation.
Managing Orphaned Datastore Objects
vSAN is an object-based datastore. The objects typically represent entities such as Virtual Machines, Performance history database, iSCSI objects, Persistent volumes, and vSphere Replication Data. An object may inadvertently lose its association with a valid entity and become orphaned. Objects in this state are termed as orphaned or unassociated objects. While orphaned objects do not critically impact the environment, they contribute to unaccounted capacity and skew reporting.
Common causes for orphaned objects include but not limited to:
- Objects that were created manually instead of using vCenter or an ESXi host
- Improper deletion of a virtual machine such as deleting files through a command-line interface(CLI)
- Using vSAN datastore to store non-standard entities such as ISO images
- Manage files directly through vSAN datastore browser
- Residual objects caused by incorrect snapshot consolidation or removal by 3rd party utilities
Identification and Validation
Unassociated objects can be ascertained through command-line utilities such as Ruby vSphere Console(RVC) and Go-based vSphere CLI(GOVC). RVC is embedded as part of the vCenter Server Appliance(vCSA). GOVC is a single static binary that is available in GitHub and can be installed across different OS platforms.
Here are the steps to identify the specific objects,
RVC
Command Syntax: vsan.obj_status_report -t <pathToCluster>
Sample Command and Output:
>vsan.obj_status_report /localhost/vSAN-DC/computers/vSAN-Cluster/ -t
2020-03-19 06:05:29 +0000: Querying all VMs on vSAN .
Histogram of component health for possibly orphaned objects
+-------------------------------------+------------------------------+
|Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
+-------------------------------------+------------------------------+
Total orphans: 0
GOVC
Command Syntax: govc datastore.vsan.dom.ls -ds <datastorename> -l -o
Sample Command: govc datastore.vsan.dom.ls -ds vsanDatastore -l -o
<Command does not return an output if no unassociated objects are found>
Additional Reference to this task can be found at KB 70726
Recommendation: Contact VMware Technical Support to help validate and delete unassociated objects. Incorrect detection and deletion of unassociated objects may lead to loss of data.
Summary
Multiple reasons can cause objects to become unassociated from a valid entity. The existence of unassociated objects does not critically affect the production workloads. However, these objects could gradually consume significant capacity leading to operational issues. Command-line utilities help identify such objects and, to a certain extent, also help in understanding the root cause. While the CLI utilities also enable the deletion of unassociated objects, it is recommended to contact VMware Technical Support to assist with the process.
Capacity Management Guidance
Managing capacity in a distributed system like vSAN is a little different than that of a three-tier architecture. vSAN uses a set of discrete devices across hosts and presents it as a single vSAN datastore. The vCenter Server UI also abstracts the complexities of where the data is placed and presents capacity utilization as a single datastore, which simplifies the capacity management experience of vSAN.
vSAN uses free space for the purposes of transient activities such as accommodating data placement changes and the rebalancing or repair of data. Free capacity is also used in the event of a sustained host failure, where the provided by the failed host must be reconstructed somewhere else in the cluster.
With versions prior to vSAN 7 U1, properly managing the recommended level of free capacity meant the administrator must pay close attention to the effective capacity consumed of a cluster through the vCenter UI, or some other monitoring and management mechanism (vRealize Operations, vCenter alerts, etc.). This was heavily reliant on good administrative discipline.
vSAN 7 U1 introduces new tools to help you safely manage and sustain services as the demand for resources grows. The new "Reserved Capacity" is a capacity management feature provided a dynamic calculation of the estimated free capacity required for transient operations, and host rebuild reserve capacity, and will adjust the UI to reflect these thresholds. It also allows vSAN to employ safeguards and health checks to help prevent the cluster from exceeding critical capacity conditions.
The amount of capacity that the UI allocates for the host rebuild reserve and operations reserve is a complex formula based on several variables and conditions. vSAN makes this calculation for you. However, if you would like to understand "what if" scenarios for new cluster configurations, use the VMware vSAN Sizer tool, which includes all of the calculations used by vSAN for sizing new environments, and estimating the required amount of free capacity necessary for the operations reserve, and host rebuild reserve.
Note that in vSAN 7 U1, the capacity reserves are disabled by default This is to accommodate topologies that the feature does not support at this time, such as stretched clusters and clusters using explicit fault domains. It also allows for a soft introduction into existing environments that have been upgraded to vSAN 7 U1 or later.
FIGURE 4-5: Accommodating for Reserved Capacity in vSAN 7 U1 and later
If the Reserved Capacity feature is enabled in an environment, and one wishes to enable an unsupported topology or feature (ex: explicit fault domains), the vCenter Server UI may mask the ability to enable the given feature: Something to be aware of if a feature in the UI mysteriously is not there.
Recommendation. In most cases especially in on-premises environments, it is recommended to enable both the operations reserve and the host rebuild reserve. Some service provider environments may choose to only use the operations reserve toggle, as they may have different SLA and operational procedures for host outage situations.
The thresholds that the Reserved Capacity feature activate are designed to be restrictive but accommodating. The thresholds will enforce some operational changes, but allow critical activities to continue. For example, when the reserve capacity limits are met, health alerts will trigger to indicate the status, and provisioning of new VMs, virtual disks, clones, snapshots, etc will not be allowed when the threshold is exceeded. I/O activity for existing VMs will continue without issue.
If an environment is using cluster-based deduplication and compression, or the compression-only service, vSAN will calculate the free capacity requirements off of the effective savings ratios in that cluster.
Capacity consumption is usually associated with bytes of data stored versus bytes of data remaining available. There are other capacity limits that may inhibit the full utilization of available storage capacity. vSAN has a soft limit of no more than 200 VMs per host, and a hard limit of object components of no more than 9,000 components per host. Certain topology and workload combinations, such as servers with high levels of compute and storage capacity that run low capacity VMs may run into these other capacity limits. Sizing of these capacity considerations should be a part of a design and sizing exercise. See the section topic "Monitoring and Management of vSAN Object Components" for more details.
Summary
The Reserved Capacity feature of vSAN makes ensuring that sufficient free space is available for transient activities and host failures a much easier task than in previous versions. Unless your topology dictates otherwise, the use of the new safeguarding feature is highly recommended.
Section 5: Storage Policy Operations
Operational Approaches of Using SPBM in an Environment
The flexibility of SPBM allows administrators to easily manage their data center in an outcome-oriented manner. The administrator determines the various storage requirements for the VM, and assigns them as rules inside a policy. vSAN takes care of the rest, ensuring compliance of the policy.
FIGURE 5-1: Multiple rules apply to a single policy, and a single policy applies to a group of VMs, a single VM, or a single VMDK
This form of management is quite different than commonly found with traditional storage. This level of flexibility introduces the ability to prescriptively address changing needs for applications. These new capabilities should be part of how IT meets the needs of the applications and the owners that request them.
When using SPBM for vSAN, the following guidance will help operationalize this new management technique in the best way possible.
- Don’t hesitate to be prescriptive with storage policies if needed (primarily OSA). If an SQL server—or perhaps just the virtual machine disk (VMDK) of the SQL server—serving transaction logs needs higher protection, create and assign a storage policy for this need. The storage policy model exists for this very reason. There is little need to use this approach with the ESA. The the post "RAID-5/6 with the Performance of RAID-1 using the vSAN ESA" for more information.
- Refrain from unnecessary complexity. Take an “as needed” approach for storage policy rules. Storage policy rules such as limits for input/output operations per second (IOPS) allow you to apply limits to a wide variety of systems quickly and easily, but may restrict performance unnecessarily. See “Using Workload Limiting Policies” in this section for more information.
- Be mindful of the physical capabilities of your hosts and network in determining what policy settings should be used as a default starting point for VMs in a cluster. The capabilities of the hosts and network play a significant part in vSAN’s performance. In an on-premises environment where hardware specifications may be modest, a more performance-focused RAID-1-based policy might make sense. In a VMware Cloud (VMC) on Amazon Web Services (AWS) environment, where host and network specifications are top-tier but capacity comes at a premium, it might make more sense to have a RAID-5-based policy to take advantage of its space efficiency.
- Be mindful of the physical capabilities of your cluster. Storage policies allow you to define various levels of protection and space efficiency for VMs. Some higher levels of resilience and space efficiency may require more hosts than are in the cluster. Review the cluster capabilities before assigning a storage policy that may not be achievable due to a limited number of hosts.
- Monitor system behavior before and after storage policy changes. With the vSAN performance service, you can easily monitor VM performance before and after a storage policy change to see if it meets the requirements of the application owners. This is how to quantify how much of a performance impact may occur on a VM. See the section “Monitoring vSAN Performance” for more information.
Recommendation: Do not change the vSAN policy known as the “default storage policy.” It represents the default policy for all vSAN clusters managed by that vCenter server. If the default policy specifies a higher layer of protection, smaller clusters may not be able to comply.
Storage policies can always be adjusted without interruption to the VM. Some storage policy changes will initiate resynchronization to adjust the data to adhere to the new policy settings. See the topic “Storage Policy Practices to Minimize Resynchronization Activities” for more information.
Storage policies are not additive. You cannot apply multiple policies to one object. Remember that a single storage policy is a collection of storage policy rules applied to a group of VMs, a single VM, or even a single VMDK.
Recommendation: Use some form of a naming convention for your storage policies. A single vCenter server houses storage policies for all clusters that it manages. As the usefulness of storage policies grows in an organization, naming conventions can help reduce potential confusion. See the topic “Managing a Large Number of Storage Policies.”
Summary
Become familiar with using vSAN storage policies in an environment so administration teams can use storage policies with confidence. Implement some of the recommended practices outlined here and in other storage policy related topics for a more efficient, predictable outcome for changes made to an infrastructure and the VMs it powers.
Creating a vSAN Storage Policy
SPBM from VMware enables precise control of storage services. Like other storage solutions, vSAN provides services such as availability levels, capacity consumption, and stripe widths for performance.
Each VM deployed to a vSAN datastore is assigned at least one storage policy that defines VM storage requirements, such as performance and availability. If you do not assign a storage policy when provisioning a VM, vSAN assigns the Default Storage Policy. This policy has a level of FTT set to 1, a single disk stripe per object, and a thin-provisioned virtual disk.
FIGURE 5-2: Setting policy rules within a vSAN storage policy
The following is a detailed list of all the possible vSAN storage policy rules.
When you know the storage requirements of your VMs, you can create a storage policy referencing capabilities the datastore advertises. Create several policies to capture different types or classes of requirements. When determining the use of RAID-1 versus RAID-5/6, consider the following:
- RAID-1 mirroring requires fewer I/O operations to the storage devices, so it can provide better performance. For example, a cluster resynchronization takes less time to complete with RAID-1. It is, however, a full mirror copy of the object, meaning it requires twice the size of the virtual disk.
- RAID-5 or RAID-6 erasure coding can provide the same level of data protection as RAID-1 mirroring while using less storage capacity.
- RAID-5 or RAID-6 erasure coding does not support an FTT = 3.
- Consider these guidelines when configuring RAID-5 or RAID-6 erasure coding in a vSAN cluster.
Summary
Before creating VM storage policies, it is important to understand how capabilities affect the consumption of storage in the vSAN cluster. Find more information about designing and sizing of storage policies on core.vmware.com.
Managing a Large Number of Storage Policies
The flexibility of SPBM allows administrators to easily manage their data center in an outcome-oriented manner. The administrator determines the storage requirements for the VM, assigns them as rules in a policy, and lets vSAN ensure compliance of the policy.
Depending on the need, an environment may require a few storage policies, or dozens. Before deciding what works best for your organization, let’s review a few characteristics of storage policies with SPBM.
- A maximum of 1,024 SPBM policies can exist per vCenter server.
- A storage policy is stored and managed per server but can be applied to in one or more clusters.
- A storage policy can define one or many rules (around performance, availability, and space efficiency, for example).
- Storage policies are not additive. Apply only one policy (with one or more rules) per object.
- A storage policy can be applied to a group of VMs, a single VM, or even a single VMDK.
- A storage policy name can consist of up to 80 characters.
- A storage policy name is not the true identifier. Storage policies use a unique identifier for system management.
With a high level of flexibility, users are often faced with the decision of how best to name policies and apply them to their environments.
Storage policy naming considerations
Policy names are most effective when they include two descriptors: intention and scope.
- The intention is what the policy aims to achieve. Perhaps the intention is to apply high-performing mirroring using RAID-1, with an increased level of protection by using an FTT level of 2.
- The scope is where the policy will be applied. Maybe the scope is a server farm hosting the company ERP solution, or perhaps it is just the respective VMDKs holding databases in a specific cluster.
Let’s examine the policies in FIGURE 5-3.
FIGURE 5-3: A listing of storage policies managed by vCenter
- CLO1-R1-FTT1: CL01 (Cluster 1) R1 (RAID-1 Mirror) FTT=1 (Failures to Tolerate = 1)
- CLO1-R1-FTT2-SW6: CL01 (Cluster 1) R1 (RAID-1 Mirror) FTT=2 (Failures to Tolerate = 2) SW6 (Stripe Width = 6)
- CLO1-R5-FTT2-SW6: CL01 (Cluster 1) R5 (RAID-5 Mirror) FTT=2 (Failures to Tolerate = 2) SW6 (Stripe Width = 6)
Recommendation: Avoid using and changing the default vSAN storage policy. If a RAID-1 FTT=1 policy is desired, simply clone the default storage policy. Create and clone storage policies as needed.
Determine the realistic needs of the organization to find the best storage policy naming conventions for an environment. A few questions to ask yourself:
- What is the size of the environment?
- Are there multiple clusters? How many?
- Are there stretched clusters?
- Is the preference to indicate actual performance/protection settings within names, or to adopt a gold/silver/bronze approach
- Are application-specific storage policies needed?
- Are VMDK-specific storage policies needed?
- What type of delimiter works best (spaces, hyphens, periods)? What works best in conjunction with scripting?
- Are there specific departments or business units that need representation in a storage policy name?
- Who is the intended audience? Virtualization administrators? Application owners? Automation teams? This can impact the level of detail you provide in a policy name.
The answers to these questions will help determine how to name storage policies, and the level of sophistication used.
Managing Policies when using the vSAN ESA
For the most part, using storage policies in the ESA is the same as the OSA. However, there is typically less of a need to create as many storage policies, simply because the ESA is able to deliver maximum levels of space efficiency without compromise in performance. In vSAN 8 U1, when using the ESA, an optional "Auto-Policy Management" capability is available. This feature allows vSAN to determine the most optimal storage policy settings for a vSAN cluster, and it creates and sets this new policy as the cluster's default storage policy. For more information, see the post "Auto-Policy Management Capabilities with the ESA in vSAN 8 U1."
Summary
An administrator has tremendous flexibility in determining what policies are applied, where they are applied, and how they are named. Having an approach to naming conventions for policies that drive the infrastructure will allow you to make changes to your environment with confidence.
Storage Policy Practices to Improve Resynchronization Management in vSAN
SPBM allows administrators to change the desired requirements of VMs at any time without interrupting the VM. This is extremely powerful and allows IT to accommodate change more quickly.
Some vSAN storage policy rules change how data is placed across a vSAN datastore. This change in data placement temporarily creates resynchronization traffic so that the data complies with the new or adjusted storage policy. Storage policy rules that influence data placement include:
- Site disaster tolerance (any changes to the options below)
- None—Standard cluster
- None—Standard cluster with nested fault domains
- Dual site mirroring (stretched cluster)
- None—Keep data on preferred (stretched cluster)
- None—Keep data on non-preferred (stretched cluster)
- None—Stretched cluster
- FTT (any changes to the options below)
- No data redundancy
- 1 failure—RAID-1 (mirroring)
- 1 failure—RAID-5 (erasure coding)
- 2 failures—RAID-1 (mirroring)
- 2 failures—RAID-6 (erasure coding)
- 3 failures—RAID-1 (mirroring)
- Number of disk stripes per object
This means that if a VM’s storage policy is changed, or a VM is assigned a new storage policy with one of the rules above different than the current policy rules used, it may generate resynchronization traffic so that the data can comply with the new policy definition. When a large number of objects have their storage policy adjusted, the selection order is arbitrary and cannot be controlled by the end user.
As noted above, the type of policy rule change will be the determining factor as to whether a resynchronization may occur. Below are some examples of storage policy changes and whether or not they impart a resynchronization effort on the system. Operationally there is nothing else to be aware of other than ensuring that you have sufficient capacity and fault domains to go to the desired storage policy settings.
Existing Storage Policy Rule | New Storage Policy Rule | Resynchronization? |
RAID-1 | RAID-1 with increased FTT | Yes |
RAID-1 | RAID-1 with decreased FTT | No |
RAID-1 | RAID-5/6 | Yes |
RAID-5/6 | RAID-1 | Yes |
RAID-5 | RAID-6 | Yes |
RAID-6 | RAID-5 | Yes |
RAID-5 with stripe width=1 | RAID-5 with stripe width=2-4 | No |
RAID-5 with stripe width=4 | RAID-5 with stripe width=5 or greater | Yes |
RAID-6 with stripe width=1 | RAID-6 with stripe width=2-6 | No |
RAID-6 with stripe width=6 | RAID-6 with stripe width=7 or greater | Yes |
Checksum enabled | Checksum disabled | No |
Checksum disabled | Checksum enabled | Yes |
Object space reservations (OSR) = 0 | OSR > 0 | Possible* |
Object space reservations (OSR) >0 | OSR=0 | No |
*OSR may not always initiate a resynchronization on increasing the value, but may depending on the fullness of the storage devices. An OSR is a preemptive reserve that may need to adjust object placement to accomodate for that new reserve assigned.
Other storage policy rule changes such as read cache reservations do not impart any resynchronization activities. For more information on stripe width changes to RAID-5/6 erasure coding, see the blog post "Stripe Width Improvements in vSAN 7 U1." For topologies that are able to use a secondary level of resilience (vSAN stretched clusters and as of vSAN 7 U3, 2-node clusters), the rules noted above will also apply to any assigned secondary level of resilience.
Recommendation: Use the VMs view in vCenter to view storage policy compliance. When a VM is assigned a new policy, or has its existing policy changed, vCenter will report it as “noncompliant” during the period it is resynchronizing. This is expected behavior.
Recommendations for policy changes for VMs
Since resynchronizations can be triggered by adjustments to existing storage policies, or by applying a new storage policy, the following are recommended.
- Avoid changing an existing policy, unless the administrator is very aware of what VMs it affects. Remember that a storage policy is a construct of a vCenter server, so that storage policy may be used by other VMs in other clusters. See the topic “Using Storage Policies in Environments with More Than One vSAN Cluster” for more information.
- If there are host failures, or any other condition that may have generated resynchronization traffic, refrain from changing storage policies at that time.
Visibility of resynchronization activity can be found in vCenter or Aria Operations. vCenter presents it in the form of resynchronization IOPS, throughput, and latency, and does so per disk group for each host. This can offer a precise level of detail but does not provide an overall view of resynchronization activity across the vSAN cluster.
FIGURE 5-4: Resynchronization IOPS, throughput, and latency of a disk group in vCenter, courtesy of the vSAN performance service
Aria Operations can offer a unique, cluster-wide view of resynchronization activity by showing a burn down rate—or, rather, the amount of resynchronization activity (by data and by object count) that remains to be completed. This is an extremely powerful view to better understand the magnitude of resynchronization events occurring.
FIGURE 5-5: Resynchronization burn down rates across the entire vSAN cluster using a Aria Operations dashboard
For more information on the capabilities of Aria Operations to show the total amount of resynchronization activity in the cluster, see the blog post “Better Visibility with New vSAN Metrics in Aria ops 7.0.”
Recommendation: Do not attempt to throttle resynchronizations using the manual slider bar provided in the vCenter UI found in older editions of vSAN. This is a feature that predates Adaptive Resync and should only be used under the advisement of GSS in selected corner cases. In vSAN 7 U1 and later, this manual slider bar has been removed, as Adaptive Resync offers a much greater level of granularity prioritization, and control of resynchronizations.
Summary
Resynchronizations are a natural result of applying new storage policies or changing an existing storage policy to one or more VMs. While vSAN manages much of this for the administrator, the recommendations above provide better operational understanding in how to best manage policy changes.
Using Workload Limiting Policies (IOPS Limits) on vSAN-Powered Workloads
The IOPS limits storage policy rule found in vSAN is a simple and flexible way to limit the amount of resources that a VMDK can use. IOPS limits can be applied to a few select VMs, or applied broadly to VMs in a cluster. While easy to enable, there are specific considerations in how performance metrics will be rendered when IOPS-limit rules are enforced.
Note that for VMs running in a vSAN environment, IOPS limits are enforced exclusively through storage policies. VMDK-specific IOPS limits through Storage I/O Control (SIOC) have no effect.
Understanding how IOPS limits are enforced
The rationale behind capping one or more VMDKs within a VM with an artificial IOPS limit is simple. Since the busiest VMs aren’t always the most important, IOPS limits can curtail a “noisy neighbor” consuming disproportionate resources. This can free these resources and help ensure more predictable performance across the cluster.
Measuring and throttling I/O payload using just the IOPS metric has its challenges. I/O sizes can vary dramatically, typically ranging from 4KB to 1MB in size. This means that one I/O could be 256 times the size of another, with one taking much more effort to process. When enforcing IOPS limits, vSAN uses a weighted measurement of I/O.
When applying an IOPS-limit rule to an object within vSAN, the vSAN I/O scheduler “normalizes” the size in 32KB increments. This means that an I/O under 32KB is seen as one I/O, an I/O under 64KB is seen as two, and so on. This provides a better-weighted representation of various I/O sizes in the data stream and is the same normalization increment used when imposing limits for VMs running on non-vSAN-based storage (SIOC v1).
Note that vSAN uses its own scheduler for all I/O processing and control, and thus does not use SIOC for any I/O control. For vSAN-powered VMs, normalized IOPS can be viewed adjacent to vSCSI IOPS at the VMDK level, as shown in FIGURE 5-6. When workloads use large I/O sizes, the normalized IOPS metric may be significantly higher than the IOPS observed at the vSCSI layer.
FIGURE 5-6: Viewing normalized IOPS versus vSCSI IOPS on a VMDK
This normalization measurement occurs just as I/Os are entering in the top layer of the vSAN storage stack from the vSCSI layer. Because of this, I/Os coming from or going to the vSAN caching layer, the capacity tier, or client cache on the host are accounted for in the same way. Enforcement of IOPS limits only apply to I/Os from guest VM activity. Traffic as the result of resynchronization and cloning is not subject to the IOPS-limit rule. Reads and writes are accounted for in an equal manner, which is why they are combined into a single normalized IOPS metric as shown in FIGURE 5-6. When IOPS limits are applied to an object using a storage policy rule, there is no change in behavior if demand does not meet or exceed the limit defined. When the number of I/Os exceed the defined threshold, vSAN enforces the rule by delaying the I/Os so the rate does not exceed the established threshold. Under these circumstances, the time to wait for completion (latency) of an I/O is longer.
Viewing enforced IOPS limits using the vSAN performance service
When a VM exceeds an applied IOPS-limit policy rule, any period that the IOPS limit is being enforced shows up as increased levels of latency on the guest VMDK. This is expected behavior. Figure 5-7 demonstrates the change in IOPS, and the associated latency under three conditions:
- No IOPS-limits rule
- IOPS limit of 200 enforced
- IOPS limit of 400 enforced
FIGURE 5-7: Observing enforced IOPS limits on a single VMDK, and the associated vSCSI latency
Note that, in FIGURE 5-7, the latency introduced reflects the degree IOPS need to be suppressed to achieve the limit. Suppressing the workload less results in lower latency. For this workload, suppressing the maximum IOPS to 200 introduces two to three times the amount of latency when compared to capping the IOPS to 400.
Latency introduced by IOPS limits shows up elsewhere. Observed latencies increase at the VM level, the host level, the cluster level, and even with applications like VMware Aria Operations. This is important to consider, especially if the primary motivation for using IOPS limits was to reduce latency for other VMs. When rendering latency, the vSAN performance service does not distinguish whether latency came from contention in the storage stack or latency from enforcement of IOPS limits. This is consistent with other forms of limit-based flow control mechanisms.
IOPS limits applied to some VMs can affect VMs that do not use the storage policy rule. FIGURE 5-8 shows a VM with no IOPS limits applied, yet the overall I/O was reduced during the same period as the VM shown in FIGURE 5-7. How does this happen? In this case, the VM shown in FIGURE 5-7 is copying files to and from the VM shown in FIGURE 5-7. Since it is interacting with a VM using IOPS limits, it is being constrained by that VM. Note here that unlike the VM shown in FIGURE 5-7, the VM shown in FIGURE 5-8 does not have any significant increase in vSCSI latency because the reduction in I/O is forced by the other VM in this interaction, and not by a policy applied to this VM.
FIGURE 5-8: A VM not using an IOPS-limit rule being affected by a VM using an IOPS-limit rule
It is easy to see how IOPS limits could have secondary impacts to multi-tiered applications or systems that regularly interact with each other. This reduction in performance could go easily undetected, as latency would not be the leading indicator of a performance issue.
Note that in vSAN 7 U1, latency as a result of enforced IOPS limits can be easily identified in the UI. The graphs will now show a highlighted yellow region for the time periods in which latency is as a result of the IOPS enforcement from a storage policy rule applied to the VM. It will only show the yellow region for the VM(s) that use a storage policy with the associated IOPS limit. Multi-tiered applications, or other applications and interact with a throttled VM will not show this highlighted region if it uses a different storage policy, but is in fact affected by the storage policy rule of the primary VM.
Recommendation: Avoid using the IOPS-limit rule simply because of its ease of use. Use it prescriptively, conservatively, and test the results on the impact of the VM and any dependent VMs. Artificially imposing IOPS limits can introduce secondary impacts that may be difficult to monitor and troubleshoot.
Summary
IOPS limits can be applied across some or all VMs in a vSAN cluster as a way to cap resources, and allow for growth in performance at a later date. However, VM objects controlled by IOPS-limit policy rules enforce this limit by introducing latency to the VM to not exceed the limit. This can be misleading to a user viewing performance metrics unaware that IOPS limits may be in use. It is recommended to fully understand the implications of an enforced IOPS-limit storage policy rule on the VMs, and weigh that against VMs able to complete tasks more quickly at the cost of temporarily using a higher level of IOPS.
Much of this information was originally posted under “Performance Metrics When Using IOPS Limits with vSAN—What You Need to Know” and here to assist in the effort of operationalizing vSAN. For more information on understanding and troubleshooting performance in vSAN, see the recently released white paper, “Troubleshooting vSAN Performance,” on core.vmware.com.
Using Space-Efficient Storage Policies (Erasure Coding) with Clusters Running DD&C (OSA)
Two types of space efficiency techniques are offered for all-flash vSAN clusters running the OSA. Both types can be used together or individually and have their own unique traits. Understanding the differences between behavior and operations helps administrators determine what settings may be the most appropriate for an environment.
Note that the following does NOT apply to the vSAN ESA, as it's compression and erasure coding does not negatively impact performance.
DD&C and the "Compression-only" feature are opportunistic space efficiency features enabled as a service at the cluster level. The amount of savings is based on the type of data and the physical makeup of the cluster. vSAN automatically looks for opportunities to deduplicate and compress the data per disk group as it is destaged from the write buffer to the capacity tier of the disk group. DD&C and the "Compression-only" option could be best summarized by the following:
- Offers an easy “set it and forget it” option for additional space savings across the cluster
- Small bursts of I/O do not see an increase in latency in the guest VM
- No guaranteed level of space savings
- Comes at the cost of additional processing effort to destage the data
Recommendation: Since DD&C and the "Compression-only" feature is a cluster-based service, make the decision for its use per cluster. It may be suitable for some environments and not others.
RAID-5 and RAID-6 erasure codes are a data-placement technique that stripe the data with parity across a series of nodes. This offers a guaranteed level of space efficiency while maintaining resilience when compared to simplistic RAID-1 mirroring. Unlike DD&C, RAID-5/6 can be assigned to a group of VMs, a single VM, or even a single VMDK through a storage policy. RAID-5/6 could be best summarized by the following:
- Guaranteed level of space savings
- Prescriptive assignment of space efficiency where it is needed most
- I/O amplification for all writes, impacting latency
- May strain network more than RAID-1 mirroring
Impacts when using the features together
The information below outlines considerations to be mindful of when determining the tradeoffs of using both space efficiency techniques together in a vSAN cluster using the OSA. The following is not applicable to the ESA.
- Reduced levels of advertised deduplication ratios. It is not uncommon to see vSAN’s advertised DD&C ratio reduced when combining RAID-5/6 with DD&C, versus combining RAID-1 with DD&C. This is because data placed in a RAID-5/6 stripe is inherently more space efficient, translating to fewer potential duplicate blocks. Note that while one might find the advertised DD&C ratios reduced when using RAID-5/6 erasure coding, many times the effective overall space saved may increase even more. This is described in detail in the “Analyzing Capacity Utilization with Aria Operations” section of “VMware Aria Operations and Log Insight in vSAN Environments.” This is one reason the DD&C ratio alone should not be looked used to understand space efficiency savings.
- Hardware specification changes amount of impact. Whether the space efficiency options are used together or in isolation, each method can place additional resources on the hardware. This is described in more detail below.
- Workload changes amount of impact. For any storage system, write operations are more resource intensive to process than read operations. The differences between not running any space efficiency techniques, to running both space efficiency techniques is highly dependent on how write intensive the workloads are. Sustained writes can be a performance challenge for any storage system. Space efficiency techniques add to this challenge and compound when used in combination. Infrastructure elements most affected by the features are noted below.
- DD&C. This process occurs during the destaging process from the buffer device to the capacity tier. vSAN consumes more CPU resources during the destaging process, slowing the destaging or drain rate. Slowing the rate for sustained write workloads effectively increases the buffer fill rate. When the buffer meets numerous fullness thresholds, it slows the rate of write acknowledgments sent back to the VM, increasing latency. Large buffers, multiple disk groups, and faster capacity devices in the disk groups can help reduce the level of impact.
- RAID-5/6 erasure coding. vSAN uses more CPU and network resources during the initial write from the guest VM to the buffer, as the I/O amplification increase significantly. The dependence on sufficient network performance increases when compared to RAID-1, as vSAN must wait for the completion of the writes to more nodes across the network prior to sending the write acknowledgment back to the guest. Physical network throughput and latency become critical. Fast storage devices for write buf fees (NVMe-based), multiple disk groups, and high-quality switchgear offering higher throughput (25Gb or higher) and lower latency networking can help alleviate this. vSAN 7 U2 and vSAN 7 U3 introduce some important performance improvements for VMs using RAID-5/6 erasure codes. For more information, see the post: "RAID-5/6 Erasure Coding Enhancements in vSAN 7 U2."
Since the underlying hardware can influence the degree of performance impact, determining the use of RAID-1 mirroring versus RAID-5/6 erasure coding should be evaluated case by case. For instance, clusters running a large number of VMs using RAID-5/6 may see a much less significant impact in performance using 25Gb networking than the same VMs running with 10Gb networking. Clusters using disk groups composed of extremely fast storage devices can bode all VMs, but in particular, VMs using RAID-1 mirroring. See the topic “Operational Approaches of Using SPBM in an Environment” for more information.
Recommendation: Observe the guest read/write I/O latency as seen by the VM after a VM or VMDK has had a change to a storage policy to/from RAID-5/6. This provides a good “before vs. after” to see if the physical hardware can meet some of the performance requirements. Be sure to observe latencies when there is actual I/O activity occurring. Latency measurements during periods of no I/O activity are not meaningful.
vSAN 7 U1 introduced a new cluster based "compression only" space efficiency feature. This offers a level of space efficiency that is suitable for a wider variety of workloads, and minimizes performance impact. In most cases, the best starting point for a cluster configuration is to enable the "Compression-only" service as opposed to Deduplication and Compression, since the former will have minimal performance impact on a cluster. More information on this feature, and if it is right for your environment can be found on the blog post: "Space Efficiency Using the New "Compression only" Option in vSAN 7 U1"
vSAN 7 U2 improved the performance capabilities when VMs use a storage policy that is set to RAID-5/6 erasure coding. The improvement will be especially beneficial to workloads issue bursts or consistent levels of sequential writes, or writes using large I/O sizes. If you have tested the use of RAID-5/6 with a workload using a version previous to vSAN 7 U2, retesting the behavior of it in your environment after the upgrade may be a prudent step. For more information on this enhancement, see the blog post: "RAID-5/6 Erasure Coding Enhancements in vSAN 7 U2"
vSAN 7 U3 continues this improvement of RAID-5/6 erasure coding through the concept known as "strided writes." This is an opportunistic performance improvement, where, if the incoming I/O meets certain requirements, vSAN can significantly reduce the I/O amplification of writes that occurs when updating data fragments across a stripe. It reduces the serialization of I/O requests, which makes I/O updates complete faster.
Recommendation: Revisit the use of RAID-5/6 erasure coding if you have recently moved from a much older version of vSAN to vSAN 7 U3 or newer.
Testing
One or both space efficiency techniques can be tested with no interruption in uptime. Each possesses a different level of burden on the system to change.
- DD&C or "Compression-only" can be toggled off, but to do so, vSAN must perform a rolling evacuation of data from each disk group on a host to reformat the disk group. This is generally why it is recommended to decide on the use of DD&C or "Compression-only" prior to the deployment of a cluster.
- RAID-5/6 erasure coding can be changed to a RAID-1 mirror by assigning a new policy to the VMs using the erasure coding scheme. This will, however, create resynchronization traffic for the changed systems and use free space to achieve the change. See the topic “Storage Policy Practices to Improve Resynchronization Management in vSAN” for more information.
Recommendation: Any time you go from using space efficiency techniques to not using them, make sure there is sufficient free space in the cluster. Avoiding full-cluster scenarios is an important part of vSAN management.
Summary
The cluster-level feature of DD&C, as well as RAID-5/6 erasure coding applied using storage policies, offers flexibility for the administrator. For clusters running the OSA, the decision process for determining the use of DD&C or "Compression-only" should be done per cluster, and the decision process for RAID-5/6 should be done per VM or VMDK.
Using Number of Disk Stripes Per Object on vSAN-Powered Workloads
The number of disk stripes per object storage policy rule aims to improve the performance of vSAN by distributing object data across more capacity devices. Sometimes referred to as “stripe width,” it breaks the object components into smaller chunks on the capacity devices so I/O can occur at a higher level of parallelism. When it should be used, and to what degree it improves performance, depend on a number of factors.
Note that the number of disk stripes per object storage policy rule is largely unnecessary when using the ESA. For more information, see the post: "Stripe Width Storage Policy Rule in the vSAN ESA."
FIGURE 5-9: The “number of disk stripes per object” policy rule with an object using a RAID-1 mirror, and how it will impact object component placement
How many devices the object is spread across depends on the value given in the policy rule. A valid number is between 1 and 12. When an object component uses a stripe width of 1, it resides on at least one capacity device. When an object component uses a stripe width of 2, it is split into two components, residing on at least two devices. When using stripe width with RAID-5/6 erasure codes, the behavior will depend on the version of vSAN used.
- In versions prior to vSAN 7 U1, the stripe itself did not contribute to the stripe width. A RAID-5 object with a stripe width of 1 would have a total of 4 components spread across 4 hosts. A RAID-5 object with a stripe width of 2 would have a total of 8 components spread across 4 or more hosts. A RAID-5 object with a stripe width of 4 would have a total of 16 components spread across 4 or more hosts.
- In vSAN 7 U1 and later, the stripe itself contributes to the stripe width count. A RAID-5 object width a stripe width of 4 will have 4 components spread across 4 hosts. A RAID-5 object with a tripe width of 8 would have a total of 8 components spread across 4 or more hosts. For more information, see the post "Stripe Width Storage Policy Improvements in vSAN 7 U1."
Up until vSAN 7 U1, components of the same stripe would strive to reside in the same disk group. From vSAN 7 U1 forward, components of the same stripe will strive to reside on different disk groups to improve performance.
The storage policy rule simply defines the minimum. vSAN may choose to split the object components even further if the object exceeds 255GB (the maximum size of a component), uses a RAID-5/6 erasure coding policy, or vSAN needs to split the components for rebalancing objectives. New to vSAN 7 U1, there will also be limitations of stripe width settings for objects beyond 2TB in size. The implemented maximum will be 3 for objects greater than 2TB. Meaning that the first 2TB will be subject to the stripe width defined, with the rest of stripe using a stripe width of 3.
Setting the stripe width can improve reads and writes but in different ways. Performance improves if the following conditions exist:
- Writes: Writes committed to the write buffer are constrained by the physical capabilities of the devices to receive data from the buffer quickly enough. The slow rate of destaging eventually leads to full write buffers, increasing latency seen by the guest VMs.
- Reads: Read requests of uncached data (“cache misses”) coming directly from capacity devices—whether spinning disk on a hybrid system or flash devices on an all-flash vSAN cluster. All-flash clusters read data directly from the capacity devices unless the data still resides in the buffer. Hybrid systems have a dedicated allocation on the cache device for reads, but fetches data directly from the capacity devices if uncached. This increases latency seen by the guest VMs.
The degree of improvement associated with the stripe width value depends heavily on the underlying infrastructure, the application in use, and the type of workflow. To improve the performance of writes, vSAN hosts that use disk groups with a large performance delta (e.g., NVMe buffer and SATA SSD for capacity, or SAS buffer and spinning disk for capacity) see the most potential for improvement. While systems such as vSAN clusters running NVMe at the buffer and the capacity tier would likely not see any improvement.
Depending on the constraints of the environment, the most improvement may come from increasing the stripe width from 1 to 2–4. Stripe width values beyond that generally offer diminishing returns and increase data placement challenges for vSAN. Note that stripe width increase improves performance only if it addresses the constraining element of the storage system.
Recommendation: Keep all storage policies to a default number of disk stripes per object of 1. To experiment with stripe width settings, create a new policy and apply it to the discrete workload to evaluate the results, increasing the stripe width incrementally by 1 unit, then view the results. Note that changing the stripe width will rebuild the components, causing resynchronization traffic.
The impact on placement flexibility from increasing stripe width
Increasing the stripe width may make object component placement decisions more challenging. Storage policies define the levels of FTT, which can be set from 0 to 3. When FTT is greater than 0, the redundant data must be placed on different hosts to maintain redundancy in the event of a failure—a type of anti-affinity rule. Increasing the stripe width means the object components are forced onto another device in the same disk group, another device in a different disk group on the same host, or on another host. Spreading the data onto additional hosts can make it more challenging for vSAN to honor both the stripe width rule and the FTT. Keep the stripe width set to 1 to maximize flexibility in vSAN’s data placement options.
When to leave the stripe width setting to a default of 1
- Initially for all cluster types and all conditions.
- Clusters with DD&C enabled. Object data placed on disk groups using DD&C sprinkles the data across the capacity devices in the disk group. This is effectively an implied stripe width (but does not show up in the UI indicating otherwise), and therefore setting the stripe width value has no bearing on how the data is placed on the capacity devices.
- Smaller clusters.
- Hosts using fewer capacity devices.
- All-flash vSAN clusters with no large performance delta between the write buffer and capacity devices, and are meeting performance expectations as seen by the guest VM.
- You want to maximize the data placement options for vSAN.
When to explore the use of increasing stripe width
- Hybrid based vSAN clusters not meeting performance objectives.
- Clusters with DD&C disabled and not meeting performance objectives.
- Larger clusters not meeting performance objectives.
- Clusters with a larger amount of capacity devices and not meeting performance objectives.
- All-flash vSAN clusters not meeting performance objectives (for reasons such as sustained or streaming writes) and with a large performance delta between the write buffer and capacity devices.
Recommendation: The proper stripe width should not be based on a calculation of variables such as number of hosts, capacity devices, and disk group. While those are factors to consider, the proper stripe width (beyond 1) should always be a reflection of testing the results on a discrete workload, and understanding the tradeoffs in data placement flexibility.
Summary
VMware recommends leaving the “Number of disk stripes per object” storage policy rule to the default value of 1, and not using it at all for clusters using the ESA. While increasing the stripe width may improve performance in very specific conditions, it should only be implemented after testing against discrete workloads.
Operational Impacts of Using Different Types of Storage Policies
Using different types of storage policies across a vSAN cluster is a great example of a simplified but tailored management approach to meet VM requirements and is highly encouraged. Understanding the operational impacts of different types of storage policies against other VMs and the requirements of the cluster is important and described in more detail below.
What to look for
VMs using one policy may have a performance impact over VMs using another policy (OSA). For example, imagine a 6-node cluster using 10GbE networking and powering 400 VMs, 390 of them using a RAID-1 mirroring policy and 10 of them running a RAID-5 policy. This will only be a consideration in OSA clusters. ESA-based clusters will not have any negative impacts when using RAID-5/6 erasure coding.
At some point, an administrator decides to change 300 of those VMs to the more network-reliant RAID-5 policy. With 310 of the 400 VMs using the more network-intensive RAID-5 policy, this is more likely to impact the remaining 90 VMs running the less network-intensive RAID-1 policy, as it may run into higher contending conditions on that 10GbE connection. These symptom and remediation steps are described in detail in the “Troubleshooting vSAN Performance” document under “Adjust storage policy settings on non-targeted VM(s) to reduce I/O amplification across cluster” on core.vmware.com.
The attributes of a storage policy can change the requirements of a cluster. For example, in the same 6-node cluster described above, an evaluation of business requirements has determined that 399 of the 400 VMs can be protected with a level of FTT of 1, and the remaining VM needs an FTT of 3. The cluster host count can easily comply with the minimum host count requirements associated with policies using an FTT=1, but 6 hosts is not sufficient running a policy with an FTT=3. In this case, the cluster would have to be increased to 7 hosts to meet the absolute minimum requirement, or 8 hosts to meet the preferable minimum requirement for VMs using storage policies with an FTT=3. It only takes one object assignment.
FIGURE 5-10: Minimum host requirements of storage policies (not including N+x sizing)
The above illustration only applies to clusters running the OSA. The vSAN ESA has different requirements for storage policies, and using a different, adaptive RAID-5 erasure coding technique.
Other less frequently used storage policy rules can also impact data placement options for vSAN. Stripe width is one of those storage policy rules. Using a policy with an assigned stripe width rule of greater than 1 can make object component placement decisions more challenging, especially if it was designed to be used in another vSAN cluster with different physical characteristics. See the topic “Using Number of Disk Stripes Per Object on vSAN-Powered Workloads” for more information.
The attributes of a storage policy can change the effective performance and capacity of the VMs running in the cluster. Changes in a data placement scheme (RAID-1, RAID-5, RAID-6) or Failures to Toleration (FTT=1, 2, or 3) can have a substantial impact on the effective performance and capacity utilization of the VM. For example, in a stretched cluster, changing that was simply protected across sites to one that introduces a secondary level of resilience at the host level can amplify the number of write operations substantially. See the post: "Performance with vSAN Stretched Clusters" for more details. For 2-node topologies, vSAN 7 U3 introduced the ability to offer a secondary level of resilience (at the disk group level), and has similar considerations as mentioned with the secondary levels of resilience with stretched clusters. See the post: "Sizing Considerations with 2-Node vSAN Clusters running vSAN 7 U3" for more information.
Summary
The flexibility of SPBM should be exploited in any vSAN cluster. Accommodating for the behaviors described above helps reduce any unexpected operational behaviors and streamlines management of the cluster and the VMs it powers.
Using Storage Policies in Environments with More Than One vSAN Cluster
The flexibility of SPBM allows administrators to easily manage their data center in an outcome-oriented manner. The administrator determines the various storage requirements for the VM, and assigns them as rules inside a policy. vSAN takes care of the rest:ensuring compliance of the policy.
While storage policies can be applied at a VM or even VMDK level, they are a construct of a vCenter server. vSAN storage policies are created and saved in vCenter, and a policy can be applied to any vSAN-powered VM or VMDK managed by that vCenter server. Since existing storage policies can be easily changed, the concern is that an administrator may be unaware of the potential impact of changing an existing policy used by VMs across multiple clusters.
Recommendation: If an administrator wants to change policy rules assigned to a VM or group of VMs, it is best to apply those VMs to a storage policy already created, or create a new policy if necessary. Changing the rules of an existing policy could have unintended consequences across one or more clusters. This might cause a large amount of resynchronizations as well as storage capacity concerns.
Improving operational practices with storage policies
In many cases, categorizing storage policies into one of three types is an effective way to manage VMs across larger environments (primarily for OSA-based clusters):
- Storage policies intended for all vSAN clusters. These might include simple, generic policies that could be used as an interim policy for the initial deployment of a VM.
- Storage policies intended for a specific group of vSAN clusters. Storage policies related to VDI clusters, for example, or perhaps numerous branch offices that have very similar needs. Other clusters may have distinctly different intentions and should use their own storage policies.
- Storage policies intended for a single cluster. These policies might be specially crafted for specific applications within a cluster—or tailored to the design and configuration of a specific cluster. This approach aligns well with the guidance found in the topic “Using Storage Policies in Environments with Both Stretched and Non-Stretched Clusters.” Since a stretched cluster is a cluster-based configuration, storage policies intended for standard vSAN clusters may not work with vSAN stretched clusters.
A blend of these offers the most flexibility while minimizing the number of storage policies created, simplifying ongoing operations.
This approach places additional emphasis on storage policy naming conventions. Applying some form of taxonomy to the storage policy names helps reduce potential issues where operational changes were made without an administrator being aware of the impact. Beginning the policy name with an identifying prefix is one way to address this issue.
Recommendation: When naming storage policies, find the best balance of descriptive, self-documenting storage policies, while not becoming too verbose or complex. This may take a little experimentation to determine what works best for your organization. See the topic, “Managing a Large Number of Storage Policies” for more information.
An example of using storage policies more effectively in a multi-cluster environment can be found in the illustration below. FIGURE 5-11 shows a mix of storage policies that fall under the three categories described earlier.
FIGURE 5-11: A mixture of shared and dedicated storage policies managed by a single vCenter server
These are only examples to demonstrate how storage policies can be applied across a single cluster, or several clusters in a vSAN-powered environment. The topology and business requirement determines what approach makes most sense for an organization.
The vSAN ESA no longer has negative performance impacts in many of it's storage policy rules, and thus can result in a simpler strategy for storage policy management. The ESA in vSAN 8 U1 introduces an "Auto-Policy Management" feature that creates and manages a cluster-specific storage policy that is tuned specifically for the traits of the cluster. For more information, see the post: "Auto-Policy Management Capabilities with the ESA in vSAN 8 U1."
Summary
For vSAN-powered environments consisting of more than one cluster, using a blend of storage policies that apply to all clusters as well as specific clusters provides the most flexibility for your environment while improving operational simplicity.
Section 6: Host and EMM Operations
When to Use Each of the Three Potential Options for EMM
All hosts in a vSAN cluster contribute to a single shared vSAN datastore for that specific cluster. If a host goes offline due to any planned or unplanned process, the overall storage capacity for the cluster is reduced. From the perspective of storage capacity, placing the host in maintenance mode is equivalent to its being offline. During the decommissioning period, the storage devices of the host in maintenance mode won’t be part of the vSAN cluster capacity.
Maintenance mode is mainly used when performing upgrades, patching, hardware maintenance such as replacing a drive, adding or replacing memory, or updating firmware. For network maintenance that has a significant level of disruption in connectivity to the vSAN cluster and other parts of the infrastructure, a cluster shutdown procedure may be most appropriate. Rebooting a host is another reason to use maintenance mode. For even a simple host restart, it is recommended to place the host in maintenance mode.
Placing a given host in maintenance mode impacts the overall storage capacity of the vSAN cluster. Here are some pre-requisites that should be considered before placing a host in decommission mode:
- It is always better to decommission one host at a time.
- Maintain sufficient free space for operations such as VM snapshots, component rebuilds, and maintenance mode.
- Verify the vSAN health condition of each host.
- View information about the number of objects that are currently being synchronized in the cluster, the estimated time to finish the resynchronization, the time remaining for the storage objects to fully comply with the assigned storage policy, and so on.
- Think about changing the settings of the vSAN repair timer if the maintenance is going to take longer than 60 minutes.
A pre-check simulation is performed on the data that resides on the host so that vSAN can communicate to the user the type of impact the EMM will have, all without moving any data. A host-level pre-check simulation was introduced in vSAN 6.7 U1, and a disk group level pre-check simulation was introduced in vSAN 7 U1. The latter aims to provide the same level of intelligence for decommissioning a disk group as it would for decommissioning a host..
If the pre-check results show that a host can be seamlessly placed in maintenance mode, decide on the type of data migration. Take into account the storage policies that have been applied within the cluster. Some migration options might result in a reduced level of availability for some objects. Let’s look at the three potential options for data migration:
Full data migration—Evacuate all components to other hosts in the cluster.
This option maintains compliance with the FTT number but requires more time as all data is migrated from the host going into maintenance mode. It usually takes longer for a host to enter maintenance mode with Full data migration versus Ensure accessibility. Though this option assures the absolute availability of the objects within the cluster, it causes a heavy load of data transfer. This might cause additional latency if the environment is already busy. When it is recommended to use Full data migration:
- If maintenance is going to take longer than the rebuild timer value.
- If the host is going to be permanently decommissioned.
- If you want to maintain the FTT method during the maintenance.
Ensure accessibility (default option)—Instructs vSAN to migrate just enough data to ensure every object is accessible after the host goes into maintenance mode.
vSAN searches only for data with RAID-0 and move/regenerate them on a host different than the one entering in maintenance mode. All the other objects with RAID-1 and higher, should already have at least one copy residing on different host within the cluster. Once the host comes back to operational, the data components left on the host in maintenance mode update with changes that have been applied on the components from the hosts that have been available. Keep in mind the level of availability might be reduced for objects that have components on the host in maintenance mode.
FIGURE 6-1: Understanding the “Ensure accessibility” option when entering a host into maintenance mode
When it is recommended to use Ensure accessibility:
- This maintenance mode is intended to be used for software upgrades or node reboots. Ensure accessibility gives the opportunity to avoid needless Full data migration, since the host will be back to operational in a short time frame. It is the most versatile of all EMM options
No data migration—No data is migrated when this option is selected.
A host will typically enter maintenance mode quickly with this option, but there is a risk if any of the objects have a storage policy assigned with PFTT=0. As seen in FIGURE 6-2, both components will be inaccessible. When it is recommended to use No data migration:
- This option should be applied while some network changes are to be applied. In that specific case, all the nodes from the cluster should be placed in maintenance mode, selecting the “No data migration” option.
- This option is best for short amounts of planned downtime where all objects are assigned a policy with PFTT=1 or higher, or where downtime of objects with PFTT=0 is acceptable.
Our recommendation is to always build a cluster with the minimum number of hosts n + 1. This configuration allows vSAN to self-heal in the event of a host failure or a host entering in maintenance mode.
FIGURE 6-2: Required and recommended hosts in a cluster (OSA) when selecting the desired level of failure to tolerate
There is no need to keep a host in maintenance mode in perpetuity to achieve an N+1 or hotspare objective. vSAN's distributed architecture already achieves this. To ensure a cluster is properly sized for N+1 or greater failures, use the vSAN Sizer, and follow the recommendations in the vSAN Design Guide. Two of the most helpful recommendations that helps achieve this result includes:
- Enabling the "Operations Reserve" and "Host Rebuild Reserve" in the vSAN Capacity Management UI will aid in this effort to ensure that sufficient free capacity is available in the event of a sustained host failure.
- Ensure that the host count of the cluster is one more than the minimum required by the storage policy used in the cluster that requires the most hosts.
Summary
Placing a host in maintenance mode is a best practice when there is a need to perform upgrades, patching, hardware maintenance such as replacing a drive, adding or replacing memory, firmware updates, or network maintenance. There are few pre-checks to be made before placing a host in maintenance mode because the storage capacity within the vSAN cluster will be reduced once the host is out of operation. The type of data migration should be selected considering the type of storage policies that have been applied within the cluster to assure data resilience.
Enter a Host into Maintenance Mode in a Standard vSAN Cluster
Since each vSAN host in a cluster contributes to the cluster storage capacity, entering a host into maintenance mode takes on an additional set of tasks when compared to a traditional architecture. For this reason, vSAN administrators are presented three host maintenance mode options:
- Full data migration—Evacuate all of the components to other hosts in the cluster.
- Ensure accessibility—Evacuate enough components to ensure that VMs can continue to run, but noncompliant with the respective storage policies.
- No data migration—Evacuate no components from this host.
FIGURE 6-3: The vSAN data migration options when entering a host into maintenance mode
EMM pre-check simulation
vSAN maintenance mode performs a full simulation of data movement to determine whether the enter maintenance mode (EMM) action will succeed or fail before it even starts. This will prevent unnecessary data movement, and provide a result more quickly to the administrator.
Canceling maintenance mode
The latest version of vSAN improves the ability to cancel all operations related to a previous EMM event. In previous editions of SAN, customers who start an EMM, then cancel it and start again on another host , could introduce unnecessary resynchronization traffic. Previous vSAN versions would stop the management task, but not necessarily stop the queued resynchronization activities. Now, when the cancel operation is initiated, active resynchronizations will likely continue, but all resynchronizations related to that event that are pending in the queue will be canceled.
Note that standard vSAN clusters using the explicit Fault Domains feature may require different operational practices for EMM. In particular, clusters with no N+1 fault domains beyond the storage policies used, and fault domains using a shallow host count: such as 2 hosts per fault domain. These design decisions may increase the operational complexity of EMM practices when cluster capacity utilization is high, which is why they are not recommended. More information can be found on the blog post “Design and Operation Considerations When Using vSAN Fault Domains.”
There is no need to keep a host in maintenance mode in perpetuity to achieve an N+1 or hotspare objective. vSAN's distributed architecture already achieves this. To ensure a cluster is properly sized for N+1 or greater failures, use the vSAN Sizer, and follow the recommendations in the vSAN Design Guide. Two of the most helpful recommendations that helps achieve this result includes:
- Enabling the "Operations Reserve" and "Host Rebuild Reserve" in the vSAN Capacity Management UI will aid in this effort to ensure that sufficient free capacity is available in the event of a sustained host failure.
- Ensure that the host count of the cluster is one more than the minimum required by the storage policy used in the cluster that requires the most hosts.
Summary
When taking a host in a vSAN cluster offline, there are several things to consider, such as how long the host will be offline, and the storage policy rules assigned to the VMs that reside on the host. When entering a host into maintenance mode, the “Ensure accessibility” option should be viewed as the most flexible way to accommodate host updates and restarts, while ensuring that data will remain available—albeit at a potentially reduced level of redundancy.
Enter a Host into Maintenance Mode in a 2-Node Cluster
VMs deployed on vSAN 2-node clusters typically have mirrored data protection, with one copy of data on node 1, a second copy of the data on node 2, and the Witness component placed on the vSAN Witness Host. The vSAN Witness Host can be either a physical ESXi host or a vSAN Witness Appliance.
FIGURE 6-4: A typical topology of vSAN 2-node architecture
In a vSAN 2-node cluster, if a host must enter maintenance mode, there are no other hosts to evacuate data to. As a result, guest VMs are out of compliance and are exposed to potential failure or inaccessibility should an additional failure occur.
Maintenance mode on a data node
- Full data migration. Not available for 2-node vSAN clusters using the default storage policy, as policy compliance requires two for data and one for the Witness object.
- Ensure accessibility. The preferred option for two-host or three-host vSAN clusters using the default storage policy. Ensure accessibility guarantees enough components of the vSAN object are available for the object to remain available. Though still accessible, vSAN objects on two- or three-host clusters are no longer policy compliant. When the host is no longer in maintenance mode, objects are rebuilt to ensure policy compliance. During this time, however, vSAN objects are at risk because they become inaccessible if another failure occurs. Any objects that have a non-standard, single-copy storage policy (FTT=0) are moved to an available host in the cluster. If there is insufficient capacity on any alternate hosts in the cluster, the host will not enter maintenance mode.
- No data migration.This is not a recommended option for vSAN clusters. vSAN objects that use the default vSAN storage policy may continue to be accessible, but vSAN does not ensure their accessibility. Any objects that have a non-standard single-copy storage policy (FTT=0) become inaccessible until the host exits maintenance mode.
Maintenance mode on the vSAN Witness Host
Maintenance mode on the vSAN Witness Host is typically an infrequent event. Different considerations should be taken into account, depending on the type of vSAN Witness Host used.
- vSAN Witness Appliance (recommended). No VM workloads may run here. The only purpose of this appliance is to provide quorum for the 2-node vSAN cluster. As of vSAN 7 U1, this witness host may be responsible for more than one 2 node vSAN cluster. Maintenance mode should be brief, typically associated with updates and patching.
- Physical ESXi host. While not typical, a physical host may be used as a vSAN Witness Host. This configuration supports VM workloads. Keep in mind that a vSAN Witness Host may not be a member of any vSphere cluster, and as a result, VMs have to be manually moved to an alternate host for them to continue to be available.
When maintenance mode is performed on the vSAN Witness Host, the Witness components cannot be moved to either site. When the Witness Host is put in maintenance mode, it behaves as the No data migration option would on site hosts. It is recommended to check that all VMs are in compliance and there is no ongoing failure, before doing maintenance on the Witness Host.
Note that prior to vLCM found in vSphere 7 and later, VUM required 2-node clusters to have HA disabled before a cluster remediation followed by a re-enable after the upgrade. vLCM does not require this step.
Recommendation: Before deploying a vSAN 2-node cluster, be sure to read the vSAN 2-Node Guide on the core.vmware.com.
Summary
With a vSAN 2-node cluster, in the event of a node or device failure, a full copy of the VM data is still available on the alternate node. Because the alternate replica and Witness component are still available, the VM remains accessible on the vSAN datastore. If a host must enter maintenance mode, vSAN cannot evacuate data from the host to maintain policy compliance. While the host is in maintenance mode, data is exposed to a potential failure or inaccessibility should an additional failure occur.
Restarting a Host in Maintenance Mode
For typical host restarts with ESXi, most administrators get a feel for roughly how long a host takes to restart, and simply wait for the host to reappear as “connected” in vCenter. This may be one of the many reasons why out-of-band host management isn’t configured, available, or a part of operational practices. However, hosts in a vSAN cluster can take longer to reboot than non-vSAN hosts because they have additional actions to perform during the host reboot process. Many of these additional tasks simply ensure the safety and integrity of data. Incorporating out-of-band console visibility into your operational practices can play an important role for administering a vSAN environment.
Note that restarting hosts in clusters configured for the vSAN ESA will not take as long as hosts residing in a vSAN clustering configured for the OSA. While host restarts are much faster, it is still recommended to have a out of band management interface to monitor the boot status of a host.
Looking at the Direct Console User Interface (DCUI) during a host restart reveals a few vSAN-related activities. The most prominent message, and perhaps the one that may take the most time, is “vSAN: Initializing SSD… Please wait…” similar to what is shown in FIGURE 6-5.
FIGURE 6-5: DCUI showing the “Initializing SSD” status
During this step, vSAN is processing data and digesting the log entries in the buffer to generate all required metadata tables (OSA only). More detail on a variety of vSAN initialization activities can be exposed by hitting ALT + F11 or ALT + F12 in the DCUI. For detailed information, read the blog post on monitoring vSAN restarts using DCUI.
Recommendation: Use out-of-band management to view vSphere DCUI during host restarts.
Significant improvements in host restart times were introduced in vSAN 7 U1. See the post "Performance Improvements in vSAN 7 U1" for more information about this enhancement.
Summary
When entering a host into maintenance mode, there are several things to consider, like how long the host will be in maintenance mode and the data placement scheme assigned by the respective storage policies. View the “Ensure accessibility” option as a flexible way to accommodate host updates and restarts. Planned events (such as maintenance mode activities) and unplanned events (such as host outages) may make the effective storage policy condition different than the assigned policy. vSAN constantly monitors this, and when resources become available to fulfill the rules of the policy, it adjusts the data accordingly. Lastly, incorporate DCUI accessibility via remote management into defined maintenance workflows such as host restarts.
Cluster Shutdown and Power-Up
Occasionally a graceful shutdown of a vSAN cluster may need to occur. Whether it be for server relocation, or for a sustained power outage where backup power cannot sustain the cluster indefinitely. Since vSAN is a distributed storage system, care must be taken to ensure that the cluster is shut down properly. The guidance offered here will be dependent on the version of vSAN used.
The recommendations below assume that guest VMs in the cluster are shut down gracefully before beginning this process. The order that guest VMs are powered down is dependent on the applications and requirements of a given customer environment and is ultimately the responsibility of the administrator.
vSAN 7 U3 and newer
With vSAN 7 U3 and newer, a guided workflow built right into vCenter Server makes a cluster power down and power up process easy, predictable, and repeatable. This feature is a management task of the cluster. It is available in the vCenter Server UI when highlighting a given vSAN cluster, and selecting vSAN > Shutdown Cluster.
FIGURE 6-6: The logic of the new cluster shutdown workflow in vSAN 7 U3
Note that in vSAN 8, the Shutdown Cluster was enhanced to provide improved robustness under a variety of conditions, and in vSAN 8 U1, the Shutdown Cluster workflow can be executed using PowerCLI.
The workflow accommodates for vSAN clusters that are powering the vCenter Server. The process elects an orchestration host that assists in this cluster shutdown and startup process once the vCenter Server VM is powered off. The selection of the orchestration host is arbitrary, but if the cluster powers a vCenter Server, it will typically elect the host that the vCenter Server VM is associated with.
Powering down the cluster will be orchestrated by this new built-in workflow. A high-level overview of the steps includes:
- Pre-validation health checks (e.g. is HA disabled, and all VMs powered off, etc.). The workflow will be halted if it does not pass.
- Hosts sets a new flag so vSAN's object manager will pause all change control processes.
- If vCenter server resides in the same cluster, vCenter will shut down and management of subsequent tasks will be delegated to the orchestration host.
- All hosts enter maintenance mode using “no action” option to prevent unnecessary data migration.
- Hosts are shut down.
Powering up the cluster will also be orchestrated by the new built-in workflow. A high-level overview of the steps includes:
- Administrator powers on ALL hosts in the cluster (using OOB management like IPMI, iDRAC, ILO, etc.)
- The orchestration host will set the flag in vSAN's object manager back to its original state to accept CCPs.
- If vCenter Server is within the cluster that was shut down, it will be automatically powered on.
- vCenter Server will perform a health check to verify the power state and alert of any issues.
- The administrator can power on VM’s.
The workflow also supports stretched cluster and 2-node topologies, but will not power down the witness host appliance, as this is an entity that resides outside of the cluster, and may also be responsible for duties with other clusters. The feature will also be available when the ESXi host lockdown mode is enabled on the hosts in the cluster.
Some system VMs such as VCLS may be automatically managed during the shutdown process, while others may not. Examples of other system-related VMs that will need to be managed manually include:
- File Services. No prechecks or automation workflows are included at this time.
- WCP/Pod VMs. These must be manually shut down.
- NSX management VMs. These must be manually shut down.
Older editions of vSAN
Editions previous to vSAN 7 U3 required a series of steps to gracefully shutdown a vSAN cluster. The specific steps are dependent on the version of vSAN used.
- vSAN 7 through vSAN 7 U2 - Manually Shut Down and Restart a vSAN Cluster
- vSAN 6.7 - Shutting Down and Restarting a vSAN Cluster
- vSAN 6.5 - Shutting Down a vSAN Cluster
The steps described in the links above are very specific and can take time to perform accurately. Updating to vSAN 7 U3 or newer will help simplify this effort.
Recommendation: Regardless of the version of vSAN used, become familiar with the shutdown cluster process by testing it in a lab environment. This will help ensure that your operational procedures are well understood for these scenarios.
Powering up a vSAN cluster
A commonly overlooked step in the powering up of a vSAN cluster is to ensure all hosts in the cluster are powered on and fully initialized prior to powering on guest VMs. This is different than a vSphere cluster using a traditional three-tier architecture where a host that was powered on and initialized would not necessarily need to wait for other hosts to be powered on before VMs could be started. Since vSAN provides the storage resources in a distributed manner, A VM hosted on one host may have its data reside on other hosts, thus the need to ensure that all hosts are ready prior to powering on guest VMs.
Summary
The powering down and powering up a vSAN cluster is different than a vSphere cluster using a traditional three-tier architecture. The guidance provided above will help ensure that the power down and power up process is reliable and consistent.
Section 7: Guest VM Operations
Configuring and Using TRIM/UNMAP in vSAN
vSAN supports thin provisioning, which lets you use just as much storage capacity as currently needed in the beginning and then add the required amount of storage space at a later time. Using the vSAN thin provisioning feature, you can create virtual disks in a thin format. For a thin virtual disk, ESXi commits only as much storage space as the disk needs for its initial operations. To use vSAN thin provisioning, set the SPBM policy for Object Space Reservation (OSR) to its default of 0.
One challenge to thin provisioning is that VMDKs, once grown, will not shrink when files within the guest OS are deleted. This problem is amplified by the fact that many file systems always direct new writes into free space. A steady set of writes to the same block of a single small file eventually use significantly more space at the VMDK level. Previous solutions to this required manual intervention and migration with Storage vMotion to external storage, or powering off a VM. To solve this problem, automated TRIM/UNMAP space reclamation was created for vSAN 6.7U1.
Additional information can be found on the “UNMAP/TRIM space reclamation on vSAN” technote. The post: "The Importance of Space Reclamation for Data Usage Reporting in vSAN" will also be of use in better understanding TRIM/UNMAP functionality.
Planning the process
If implementing this change on a cluster with existing VMs, identify the steps to clean previously non-reclaimed space. In Linux, this can include scheduling file system (FS).Trim to run by timer, or in Windows, running the disk optimization tools or the Optimize-Volume PowerShell command. Identify any operating systems in use that may not natively support TRIM/UNMAP.
UNMAP commands do not process through the mirror driver. This means that snapshot consolidation will not commit reclamation to the base disk, and commands will not process when a VM is being migrated with VMware vSphere Storage vMotion. To compensate for this, run asynchronous reclamation after the snapshot or migration to reclaim these unused blocks. This may commonly be seen if using VADP-based backup tools that open a snapshot and coordinate log truncation prior to closing the snapshot. One method to clean up before a snapshot is to use the pre-freeze script.
Identify any VMs that you wish to not reclaim space with. For these VMs you can use a VMX flag disk.scsiUnmapAllowed set to False.
Implementation
#StorageMinute: vSAN Space Reclamation
FIGURE 7-1: Viewing the size of a virtual disk within the vSANDatastore view of Center
Validation
After making the change, reboot a VM and manually trigger space reclaim. Monitor the backend UNMAP throughput and total free capacity in the cluster increasing.
FIGURE 7-2: Viewing TRIM/UNMAP throughput on the host-level vSAN performance metrics
Potential Tuning of Workloads After Migration to vSAN
In production environments, it is not uncommon to tune VMs to improve the efficiency or performance of the guest OS or applications running in the VM. Tuning generally comes in two forms:
- VM tuning—Achieved by adjusting the VM’s virtual hardware settings (OSA and ESA), or vSAN storage policies (OSA).
- OS/application tuning—Achieved by adjusting OS or application-specific settings inside the guest VM.
The following provides details on the tuning options available, and general recommendations in how and when to make adjustments.
VM Tuning
VM tuning is common in traditional three-tier architectures as well as vSAN. Ensuring sufficient but properly sized virtual resources of compute, memory, and storage has always been important. Additionally, vSAN provides the ability to tune storage performance and availability settings per VM or VMDK through the use of storage policies. VM tuning that is non-vSAN-specific includes, but is not limited to:
- Virtual CPU
- Amount of virtual memory
- Virtual disks
- Type and number of virtual SCSI controllers
- Type and number of virtual NICs
FIGURE 7-3: Virtual hardware settings of a VM
Determining the optimal allocation of resources involves monitoring the VM’s performance metrics in vCenter Server, or augmenting this practice with other tools such as VMware Aria Operations to determine if there are any identified optimizations for VMs.
Recommendation: For VMs using more than one VMDK, use multiple virtual SCSI adapters. This provides improved parallelism and can achieve better performance. It also allows one to easily use the much more efficient and better performing Paravirtual SCSI controllers on these additional VMDKs assigned to a VM. See Page 38 of the “Troubleshooting vSAN Performance” document for more information.
VM tuning through the use of storage policies that are specific to vSAN performance and availability would include:
- Level of Failures to Tolerate (FTT)
- Data placement scheme used (RAID-1 or RAID-5/6). (only applicable in the OSA)
- Number of disk stripes per object (AKA “stripe width”) (only applicable in the OSA)
- IOPS limit for object
FIGURE 7-4: Defining the FTT level in a vSAN storage policy
Note that with the ESA, RAID-5/6 erasure coding is as fast, if not faster than RAID-1. For more information, see the post: "RAID-5/6 with the Performance of RAID-1 using the vSAN Express Storage Architecture."
Since all data lives on vSAN as objects, and all objects must have an assigned storage policy, vSAN provides a “vSAN Default Storage Policy” on a vSAN cluster as a way to automatically associate data with a set of rules defining availability and space efficiency. This is set to a default level of FTT=1, using RAID-1 mirroring, and offers the most basic form of resilience for all types of cluster sizes. In practice, an environment should use several storage policies that define different levels of outcomes, and apply them to the VMs as the requirements dictate. This is most important for clusters running the OSA. For ESA clusters, one can enable the optional Auto-Policy Management capability to simplify the process of providing an optimal storage policy setting for the cluster. See “Operational Approaches of Using SPBM in an Environment ” for more details.
Determining the appropriate level of resilience and space efficiency needed for a given workload is important, as these factors can affect results. Setting a higher level of resilience or a more space-efficient data placement method may reduce the level of performance the environment delivers to the VM. This trade-off is from the effects of I/O amplification and other overhead, described more in “Troubleshooting vSAN Performance.”
The recommendations for storage policy settings may be different based on your environment. For an example, let us compare a vSAN cluster in a private cloud versus vSAN running on VMC on AWS.
- Private cloud—The standard hardware specification for your hosts and network in a private cloud may be modest. The specification may not be able to meet performance expectations if one were to use the more space efficient but more taxing RAID-5/6 erasure coding. In those cases, it would be best to use a RAID-1 mirror as the default, and look for opportunities to apply RAID-5/6 case by case.
- VMC on AWS—The standard hardware specification for this environment is high, but consuming large amounts of capacity can be cost-prohibitive. It may make more sense to always start by using VM storage policies that use the more space-efficient RAID-5/6 erasure coding over RAID-1. Then, apply RAID-1 to discrete systems where RAID-5/6 is not meeting the performance requirements. As our cloud providers transition to the ESA, they will also be adjusting their operational behaviors, which would no longer follow the description above. Other storage policy rules that can impact performance and resilience settings are “Number of disk stripes per object” (otherwise known as “stripe width”) and “IOPS limit for object.” More details on these storage policy rules can be found in the “Storage Policy Operations” section of this document.
OS/application tuning
OS/application tuning is generally performed to help the OS or application optimize its behavior to the existing environment and applications. Often you may find this tuning in deployment guides by an application manufacturer, or in a reference architecture. Note: Sometimes, if the recommendations come from a manufacturer, they may not take a virtualized OS or application into account and may have wildly optimistic recommendations.
For high performing applications such as SQL server, ensure the guest VM volumes use a proper disk/partition alignment. Some applications such as SQL demand a highly efficient storage system to ensure that serialized, transactional updates can be delivered in a fast and efficient manner. Sometimes, due to how a guest OS volume or partition is created, I/O requests will be unaligned, causing unnecessary Read, Modify, Write (RMW) events, increasing I/O activity unnecessarily, and impacting performance . See the post "Enhancing Microsoft SQL Server Performance on vSAN (and VMC on AWS) with SQL Server Trace Flag 1800" for information on how to determine if there is I/O unalignment of your SQL Server VM, and how to correct it. In some circumstances, it can have a dramatic impact on performance. While the link above showcases the issue and benefit on Microsoft SQL Server running on Windows Server, it can can occur with other applications.
Recommendation: Avoid over-ambitious OS/application tuning unless explicitly defined by a software manufacturer, or as outlined in a specific reference architecture. Making OS and application adjustments in a non-prescriptive way may add unnecessary complexity and result in undesirable results. If there are optimizations in the OS and application, make the adjustments one at a time and with care. Once the optimizations are made, document their settings for future reference.
Summary
VM tuning, as well as OS/application tuning can sometimes stem from identified bottlenecks. The “Troubleshooting vSAN Performance” document on core.vmware.com provides details on how to isolate the largest contributors to an identified performance issue, and the recommended approach for remediation. This section details specific VM related optimizations that may be suitable for your environment.
Section 8: Data Services
Deduplication and Compression (OSA): Enabling on a New Cluster
When designing a vSAN cluster, it is worth considering from the beginning if you will be using DD&C on the cluster. Enabling DD&C retrospectively is costly from an I/O perspective, as each bit needs to be read from disk, compressed, deduplicated, and written to disk again.
This content only applies to the OSA. The ESA offers compression that is on by default, as a storage policy rule setting.
There are also a few design considerations that come into play when dealing with DD&C in a cluster running the OSA. As an example, DD&C is only supported on all-flash configurations. More details are available here: “Deduplication and Compression Design Considerations.” Additionally, all the objects provisioned to vSAN (VMs, disks, etc.) need to be thin and with their OSR on the SPBM policy set to 0%.
vSAN aligns its dedupe domain with a disk group. What this means is duplicate data needs to reside within the same disk group to be deduplicated. There are a number of reasons for this, but of utmost importance to vSAN is data integrity and redundancy. If for some reason the deduplication hash table becomes corrupted, it only affects a single copy of the data. (By default, vSAN data is mirrored across different disk groups, each viewed as its own failure domain.) So, deduplicating data this way means no data loss from hash table corruption. Enabling DD&C is easy and is documented in “Enable Deduplication and Compression on a New vSAN Cluster” as well as in FIGURE 8-1 below. Note that this process has slightly different considerations on an existing cluster:
FIGURE 8-1: Using the Configure section to enable DD&C prior to enabling vSAN
Data is deduplicated and compressed on destage from the cache to the capacity tier. This ensures not spending CPU cycles on data that may be transient, short-lived, or otherwise not efficient to dedupe if it were to just be deleted in short order. As such, DD&C savings are not immediate, but rather climb over time as data gets destaged to the capacity disks. More information on using DD&C with vSAN can be found here: “Using Deduplication and Compression.”
Introduced in vSAN 7 U1 is the new "Compression only" space efficiency feature. This may be a more suitable fit for your environment and is a good starting point if you wish to employ some levels of space efficiency, but are not sure of the effectiveness with deduplication in your environment. See the post: "Space Efficiency using the New "Compression only" Option in vSAN 7 U1" for more details.
Deduplication and Compression (OSA): Enabling on an Existing Cluster
Generally, it is recommended to enable DD&C on a vSAN cluster from the beginning, before workloads are placed on it. It is possible, however to enable DD&C retroactively with live workloads on the cluster.
This content only applies to the OSA. The ESA offers compression that is on by default, as a storage policy rule setting.
The reason to enable DD&C from the start is that, on a cluster with live workloads, every data bit has to be read from the capacity tier, compressed, deduplicated against all other bits on the disk group, and then re-written to disk. This causes a lot of storage I/O on the backend that wouldn’t exist if enabled from the start. (Though it is mitigated to a large extent by our I/O scheduler.) If you do decide to enable DD&C on an existing cluster, the process is much the same as for a new cluster. See here for details: “Enable Deduplication and Compression on Existing vSAN Cluster.”
Introduced in vSAN 7 U1 is the new "Compression only" space efficiency feature. This may be a more suitable fit for your environment and is a good starting point if you wish to employ some levels of space efficiency, but are not sure of the effectiveness with deduplication in your environment. See the post: "Space Efficiency using the New "Compression only" Option in vSAN 7 U1" for more details.
Enabling or turning off data services such as DD&C, Compression Only, and Data-at-Rest Encryption will initiate a rolling reformat, or disk format conversion (DFC) to prepare the disks and disk groups for the respective data service. The formatting helps accommodate for hash tables and other elements related to a respective data service. While it is a transparent process with live workloads remaining unaffected, it can take some time depending on the specifications of the hosts, network, cluster, and capacity utilized. The primary emphasis should be monitoring if the cluster is able to serve the needs of the VMs sufficiently. Ultimately it will be best to make these configuration choices up front prior to deploying the cluster into production.
FIGURE 8-2: Enabling the Deduplication and Compression service in a vSAN cluster
It is also important to note that data is deduplicated and compressed upon destage from the cache to the capacity tier. This ensures not spending CPU cycles on data that may be transient, short-lived, or otherwise inefficient to dedupe (if it were to just be deleted in short order, for example). DD&C savings are immediate but do climb over time as data gets destaged to the capacity disks.
Recommendation: Unless your set of workloads can take advantage of deduplication, consider using the new "Compression only" space efficiency option introduced in vSAN 7 U1 instead of DD&C. The compression only option is a more efficient and thus, higher performing space efficiency option.
Deduplication and Compression (OSA): Disabling on an Existing Cluster
Disabling DD&C on an existing vSAN cluster requires careful consideration and planning. Disabling this space-saving technique will increase the total capacity used on your cluster by the amount shown in the Capacity UI. Ensure you have adequate space on the vSAN cluster to account for this space increase to avoid full-cluster scenarios. For example, if the vSAN Capacity view tells you the “Capacity needed if disabled” is 7.5TB, at least that amount needs to be available on your cluster. You also want to account for free space to allow for evacuations, resynchronizations, and some room for data growth on the cluster.
This content only applies to the OSA. The ESA offers compression that is on by default, as a storage policy rule setting.
When DD&C is enabled on a cluster, another thing to be aware of is the backend storage I/O generated by the operation. Each disk group is in turn, one by one destroyed and recreated, data is read, rehydrated (i.e., decompressed and reduplicated), and written to the new disk group.
All VMs remain up during this operation. Because of the large amount of storage I/O that results (the higher your DD&C ratio, the more data needs to be written back to the disks), it is advised that this operation is performed during off -hours. Information on disabling DD&C can be found here: “Disable Deduplication and Compression.” For a visual representation of disabling DD&C on an existing cluster, see FIGURE 8-3.
FIGURE 8-3: Enabling or disabling a vSAN service
Deduplication and Compression (OSA): Review History of Cluster DD&C Savings
You can review the historical DD&C rates, as well as actual storage consumed and saved, in two ways: one is with Aria Operations, and the other is right within the product itself. In Aria Operations, you will find the bundled “vSAN operations” dashboard. This lists a number of metrics and trends, some of those being DD&C-related. You can view DD&C ratio, DD&C savings, and actual storage consumed as well as trends to predict time until cluster full and other useful metrics. See FIGURE 8-4 for an example.
FIGURE 8-4: Using Aria Operations to view DD&C ratios over a long period of time
If you don’t have Aria Operations, or you simply want to view some of this information in vCenter, navigate to the vSAN cluster in question and click Monitor → Capacity, then click the capacity history tab. This brings you to a UI that lets you change and view a number of things. The default is to look at the previous day’s capacity and DD&C usage and trends, but this can be customized. You have two options to view the historical DD&C ratios and space savings here. One is the default, to use the current day as a reference and view the previous X days—where you define the number of days in the UI. The other option is to click the drop-down and choose Custom.
From here you can choose the reference date and the time period. For example, if you want to view the 30 days from 31 March, you would simply choose 31 March as your reference date and insert 30 as the number of days of history you want to view. A full example of using the capacity history tab on a vSAN cluster can be seen in FIGURE 8-5.
FIGURE 8-5: Viewing the capacity utilization history of a vSAN cluster
While the vSAN ESA offers compression on a storage policy rule basis, the effective compression ratio can also be viewed in the cluster capacity view.
Data-at-Rest Encryption (OSA): Enabling on a New Cluster
Enabling data at rest encryption on a vSAN cluster running the OSA is relatively easy. When vSAN encryption is enabled, any new data written to the cluster is encrypted. Enabling vSAN encryption performs a rolling reformat that copies data to available capacity on another node, removes the now-empty disk group, and then encrypts each device in a newly recreated disk group.
While this process is relatively easy to accomplish, some requirements and considerations must be taken into account.
Note that with the ESA, a much more efficient data encryption offering is available. It however can only be enabled at the time of a cluster creation. For more information, see the post: "Cluster Level Encryption with the vSAN Express Storage Architecture" and the "vSAN Encryption Services" document.
Requirements to use vSAN encryption include:
- Licensing—When intending to use vSAN encryption, use vSAN Enterprise Edition. vSAN Enterprise is available in configurations based on per-CPU, per-virtual desktop, or per-ROBO (25-pack) licensing.
- Key management—When used with vSAN Encryption, any KMIP 1.1-compliant key manager will work and be supported by VMware GSS. There are several key management servers (KMSs) that have been validated by VMware along with their respected vendors. These validated solutions have additional testing and workflows to help with the setup and troubleshooting process that nonvalidated solutions do not have. A list of currently supported KMSs can be found on the VMware Compatibility Guide. A key encryption key (KEK) and a host key are provided to vCenter and each vSAN node. The KEK is used to wrap and unwrap the data encryption key (DEK) on each storage device. The host key is used to encrypt logs on each ESXi host.
- Connectivity—It is important to understand connectivity when using vSAN encryption. While a KMS “profile” is created in Center, each vSAN host must have its own connectivity to the KMS because hosts directly connect to the KMS using the client configuration created in vCenter. vSAN encryption was designed to work this way to ensure that hosts can boot and provide data access in the event that vCenter is not available.
Enabling vSAN encryption has some settings to be familiar with. It is important to understand where each of these contributes to the enabling process.
- Erase disks before use—When vSAN encryption is configured, new data is encrypted. Residual data is not encrypted. The rolling reformat mentioned above moves (copies) data to available capacity on another node’s disk group(s). While data is no longer referenced on the now-evacuated disk group, it is not overwritten. This checkbox ensures that data is overwritten, preventing the possibility of using disk tools or other forensic devices to potentially recover unencrypted data. The process of enabling vSAN encryption is significantly longer when selecting this checkbox due to each device being written to.
- KMS cluster—Selecting the KMS cluster is only possible if the KMS profile has already been created in vCenter. The process for adding a KMS profile can be found on core.vmware.com.
- Allow reduced redundancy—Consider the rolling reformat process for vSAN encryption (as well as DD&C). Data is moved (copied) from a disk group, the disk group is removed and recreated, and then data may or may not be moved back, depending on parameters such as storage policy rules and availability or capacity.
FIGURE 8-6: Configuring Data-at-Rest Encryption on a vSAN cluster
vSAN attempts to keep items in compliance with their storage policy. For example, when vSAN mirrors components, those mirrored components must be on separate nodes. In a 2- or 3-node vSAN cluster, where components are already stored in three different locations, the rolling reformat process of enabling or disabling vSAN encryption has nowhere to put data when a disk group is removed and recreated. This setting allows for vSAN to violate the storage policy compliance to perform the rolling reformat. It is important to consider that data redundancy reduces until the process is complete, and all data has been resynchronized.
Recommendations: Use “Erase disks before use” on clusters that have pre-existing data. This takes significantly longer but ensures no residual data. Use “Allow Reduced Redundancy” in general. This is a required setting for 2- or 3-node clusters, and it allows the process to complete when storage policies may prevent completion. And finally, enable vSAN encryption at the same time as enabling DD&C. This is to prevent having to perform the rolling disk group reformat multiple times.
Understanding the impact on resources and performance when enabling Data-at-Rest Encryption
Data services such as Data-at-Rest Encryption and Data-in-Transit Encryption often raise the question of how much of an impact will these services have in an environment. There are two ways to define the impact. 1.) Performance degradation of the VM(s), and 2.) Additional overhead (host CPU, mem, etc.). The impact on the VMs will be highly dependent on the given workloads and environment. For example, testing the impact using synthetic benchmarks can lead to results not reflective of real workloads, because synthetic workloads aim to commit every CPU cycle for the purpose of generating I/O. With real workloads, only a small percentage of CPU cycles for real workloads are for the purpose of committing I/O. The best way to understand these impacts is to run it it in a cluster with real workloads, and observe the differences in performance (differences in average guest VM latencies, and CPU overhead changes). For more information on this topic, see: Performance when using vSAN Encryption Services. Note that the performance impacts of encryption in the ESA are far less than in the OSA.
Summary
Data at rest encryption gives tremendous flexibility to encrypt all data in a vSAN cluster. Thanks to the architecture of vSAN, this decision can be performed on a per-cluster basis. Administrators can tailor this need to best align with the requirements of the organization.
Data-at-Rest Encryption: Performing a Shallow Rekey
Key rotation is a strategy often used to prevent long-term use of the same encryption keys. When encryption keys are not rotated on a defined interval, it can be difficult to determine their trustworthiness. Consider the following situation:
- A contractor sets up and configures an encrypted cluster.
- The contractor backs up the encryption keys.
- At a later date, the contractor replaces a potentially failed storage device.
If the encryption keys have not been changed (or rotated), the contractor could possibly decrypt and recover data from the suspected failed storage device. VMware recommends a key rotation strategy that aligns with our customers’ typical security practices.
The two keys used in vSAN encryption include the KEK and the DEK. The KEK is used to encrypt the DEK. Rotating the KEK is quick and easy, without any requirement for data movement. This is referred to as a shallow rekey.
FIGURE 8-7: Performing a shallow rekey for a vSAN Cluster
Clicking “Generate New Encryption Keys” in the vSAN configuration UI, followed by clicking “Generate,” performs a shallow rekey. A request for a new KEK will be generated and sent to the KMS that the cluster is using. Each DEK will be rewrapped using the new KEK.
Shallow rekey operations can also be scripted, and possibly automated using an API call or PowerCLI script. VMware {code} has an example of a PowerCLI script.
Find more information about vSAN encryption on core.vmware.com.
Recommendation: Implement a KEK rotation strategy that aligns with organizational security and compliance requirements.
Data-at-Rest Encryption: Performing a Deep Rekey
Key rotation is a strategy often used to prevent long-term use of the same encryption keys. When encryption keys are not rotated on a defined interval, it can be difficult to determine their trustworthiness. Consider the following situation:
- A contractor has physical access to an encrypted cluster.
- The contractor removes a physical device in an attempt to get data.
- A deep rekey is performed before the contractor replaces the drive.
- The drive with the incorrect encryption sequence throws an error.
If the encryption keys have not been changed (or rotated), the contractor could return the drive without its being detected. A deep rekey updates the KEK and DEK, ensuring both keys used to encrypt and secure data have been changed. vSAN encryption also assigns a DEK generation ID, which ensures that all encrypted storage devices in the cluster have been rekeyed at the same time.
A deep rekey is currently only supported on clusters running encryption that use the OSA. ESA clusters using encryption only support a shallow rekey at this time.
VMware recommends a key rotation strategy that aligns with our customers’ typical security practices. Rotating the DEK is easy, just like rotating the KEK. This is a more time-consuming process, though, as data is moved off devices as they receive a new DEK. This is referred to as a deep rekey.
FIGURE 8-8: Performing a deep rekey for a vSAN Cluster
Clicking “Generate New Encryption Keys” in the vSAN configuration UI and selecting “Also re-encrypt all data on the storage using the new keys,” followed by clicking “Generate” performs a deep rekey. Just as the process of enabling or disabling vSAN encryption, the “Allow Reduced Redundancy” option should be used for 2- or 3-node clusters.
A request for a new KEK is generated and sent to the KMS the cluster is using. Each disk group on a vSAN node evacuates to an alternate storage location (unless using reduced redundancy). When no data resides on the disk group, it will be removed and recreated using the new KEK, with each device receiving a new DEK. As this process cycles through the cluster, some data may be returned to the newly recreated disk group(s).
The UI provides warning that performance will be decreased. This is a result of the resynchronizations that have to occur when evacuating a disk group. Performance impact may not be significant, depending on the cluster configuration, the amount of data, and the workload type. Deep rekey operations can also be scripted, and possibly automated using an API call or PowerCLI script. VMware {code} has an example of a PowerCLI script.
Find more information about vSAN encryption on core.vmware.com.
Recommendations: Implement a DEK rotation strategy that aligns with organizational security and compliance requirements. Be sure to take into account that a deep rekey process requires a rolling reformat. Finally, use “Allow Reduced Redundancy” in general. This is required for 2- or 3-node clusters, and it allows the process to complete when storage policies may prevent completion.
Data-at-Rest Encryption: Using Data-at-Rest Encryption and VM Encryption Together
vSphere provides FIPS 140-2 validated data-at-rest encryption when using per-VM encryption or vSAN datastore level encryption. These features are software-based, with the task of encryption being performed by the CPU. Most server CPUs released in the last decade include the AES-NI instruction set to minimize the additional overhead. More information on AES-NI can be found on Wikipedia.
These two features provide encryption at different points in the stack, and have different pros and cons for using each. Detailed differences and similarities are can be found in the Encryption FAQ. With VM encryption occurring at the VM level, and vSAN encryption occurring at a datastore level, enabling both results in encrypting and decrypting a VM twice. Having encryption performed multiple times is typically not desirable.
Skyline Health for vSAN reports when a VM has an encryption policy (for VM encryption) and also resides on an encrypted vSAN cluster. This alert is only cautionary, and both may be used if so desired.
FIGURE 8-9: The vSAN health check reporting the use of multiple encryption types used together
“Understanding vSAN Datastore Encryption vs. VMcrypt Encryption” provides additional detail, including performance and space efficiency considerations, as well as recommendations specific to which is the most desirable per use case. The above alert is common when migrating a VM encrypted by VM encryption to a vSAN datastore. It is typically recommended to disable VM encryption for the VM if it is to reside on an encrypted vSAN cluster. The VM must be powered off to remove VM encryption. Customers wishing to prevent the VM from being unencrypted likely choose to remove VM encryption after it has been moved to an encrypted vSAN datastore.
Recommendation: With encryption being performed at multiple levels, only enable VM encryption on VMs residing on an encrypted vSAN cluster when there is an explicit requirement for it, such as, while migrating an encrypted VM to a vSAN cluster, or before moving a non-encrypted VM from a vSAN cluster
For clusters running the ESA, all vSAN traffic will inherently be encrypted in flight in addition to at-rest. However, to ensure the highest levels of security Data-in-Transit encryption remains as an available toggle in the cluster data services section. If this is enabled with Data-at-Rest Encryption in the ESA, it will encrypt each network packet uniquely so that so that identical data is not transmitted over the network. For more information, see the blog post: "Cluster Level Encryption with the vSAN Express Storage Architecture."
Data-at-Rest Encryption (OSA): Disabling on an Existing Cluster
vSAN data at rest encryption (D@RE) is a cluster-based data service that is similar to deduplication and compression in that it is either enabled, or disabled for the entire vSAN cluster. Disabling vSAN Encryption on a vSAN Cluster is as easy as enabling vSAN Encryption. When vSAN Encryption is disabled, a rolling reformat process occurs again, copying data to available capacity on another node, removes the now empty disk group, and recreates the disk group without encryption. There is no compromise to the data during the process, as all evacuation that occurs on a disk group occurs within the cluster. In other words, there is no need to use swing storage for this process.
Note that for the ESA, once Data-at-Rest Encryption is enabled at the time of a cluster build-up, it cannot be disabled.
Recommendation: Evaluate your overall business requirements when making the decision to enable or disable a cluster-based service like D@RE. This will help reduce unnecessary cluster conversions.
Disabling vSAN Encryption (OSA)
The process of disabling vSAN Encryption is the opposite of enabling vSAN Encryption, as shown in Figure 8-10.
FIGURE 8-10: Disabling vSAN data at rest encryption
As the disabling of the service occurs, data is moved (copied) from a disk group to another destination. The disk group is removed and recreated, and is ready to accept data in an unencrypted format. vSAN attempts to keep items in compliance with their storage policy. The nature of any type of rolling evacuation means that there is a significant amount of data that will need to move in order to enable or disable the service. Be mindful of this in any operational planning.
Recommendation: Disable encryption at the same time as enabling Deduplication & Compression. This is to prevent having to perform the rolling disk group reformat multiple times.
Some cluster configurations may have limited abilities to move data elsewhere while still maintaining full compliance of the storage policy. This is where the "Allow Reduced Redundancy" option can be helpful. It is an optional checkbox that appears when enabling or disabling any cluster-level data service that requires a rolling reformat. A good example using this feature could be in a 2 or a 3 node cluster, where there are inherently insufficient hosts to maintain full policy resilience during the transition. vSAN will ensure that the data maintains full availability, but at a reduced level of redundancy until the process is complete. Once complete, the data will regain its full resilience prescribed by the storage policy.
Recommendation: Use "Allow Reduced Redundancy" in general. While this is a required setting for 2 or 3 node clusters, it will allow for the process to complete when storage policies may prevent completion.
Summary
Disabling data services in vSAN is as easy and as transparent as enabling them. In the case of data at rest encryption (and deduplication & compression), vSAN will need to perform a rolling reformat of the devices, a task that is automated, but does require significant levels of data movement.
Data-in-Transit Encryption: Enable in-flight encryption to a vSAN Cluster
Introduced in vSAN 7 U1, Data-in-Transit Encryption allows for vSAN traffic to be securely transmitted from host to host in a vSAN cluster. Data-in-transit encryption provides a complete over the wire encryption solution that address host/member authentication, data integrity/confidentiality, and embedded key management. It can be used on its own, or in conjunction with vSAN Data-at-Rest Encryption to provide an end-to-end encryption solution. Both capabilities use the same vSphere based FIPS 140-2 validated cryptographic modules.
Just like vSAN Data-at-Rest Encryption, Data-in-Transit Encryption is enabled and disabled at the cluster level. Unlike Data-at-Rest Encryption, it does not use an external key management server (KMS) which can make it extremely simple to operationalize. If a cluster uses both encryption features, the features will be independently responsible for its key management. Data-at-Rest Encryption will use an external KMS, while Data-in-Transit will manage its own host keys.
Health and Management of Data-in-Transit Encryption
The vSAN Skyline Health Services will periodically check the configuration state of the hosts that comprise the vSAN cluster. The Skyline Health Service will be the first place to check if there are difficulties with enabling Data-in-Transit Encryption.
Host BIOS settings and AES-NI Offloading
The vSphere cryptographic modules used for both methods of encryption can take advantage of AES-NI offloading to minimize the CPU consumption of the hosts. Modern CPUs are much more efficient with this offloading than older CPUs, so the impact of this offloading will depend on the generation of CPUs in the hosts
Recommendation: Prior to deployment, check the BIOS to ensure that AES offloading is enabled.
The Potential Impact on Performance
Data-in-transit encryption is an additional data service that as one might expect, demands additional resources. Performance considerations and expectations should be adjusted when considering these types of security features. The degree of impact will be dependent on the workloads and hardware specifications of the environment. This is addressed in more detail in the Data-at-Rest Encryption: Enabling on a New Cluster section. Data-in-Transit does have the potential to impact guest VM latencies, since all over-the-wire communication to synchronously replicate the data must be encrypted and decrypted in flight. For more information on this topic, see: Performance when using vSAN Encryption Services.
Key Management Server options for vSAN
Key Management Server (KMS) solutions are used when some form of encryption is enabled in an environment. For vSphere and vSAN, a dedicated KMS is necessary for features such as vSphere VM Encryption, vSphere Encrypted vMotion and vSAN Data-at-Rest Encryption.
As a part of VMware's focus on building in a level of intrinsic security to its products, vSphere 7 U2 introduced the ability to provide basic key management services for vSphere hosts through the new vSphere Native Key Provider (NKP). This feature, not enabled by default, can simplify key management for vSphere environments using various forms of encryption. For more information, see the blog post: Introducing the vSphere Native Key Provider as well as the vSphere Native Key Provider Overview documentation.
Figure 8-11: Key Management Services provided through the vSphere NKP, or an external KMS solution.
Key Management Services for vSAN 7 U2 and later can be provided in one of two ways:
- Using a 3rd party, External Key Management Server (KMS) solution
- Using the integrated vSphere NKP
Either one of these two will provide the key management services necessary for vSAN Data-at-Rest Encryption. The vSAN Data-in-Transit Encryption feature transparently manages its own keys across the hosts in a vSAN cluster and therefore does not need or use any other key management provider.
The use of the vSphere NKP versus an external, full featured KMS comes down to the requirements of an organization. The vSphere NKP is ideal for customers who have simple security requirements and need basic key management for vSphere and/or vSAN only. It can only do so for vSphere related products. It may be ideal for edge or small vSAN environments that may not have a full-featured KMS solution at their disposal.
Full-featured external KMS solutions may offer capabilities that are needed by an environment that cannot be met by the vSphere NKP. Clusters using where external KMS solutions can easily co-exist with other clusters using the vSphere NKP for key management. The vSphere NKP can also serve as an introductory key provider for an environment who may be interested in a full-featured external KMS solution. Transitioning from the vSphere NKP to an external KMS (and vice versa) is a simple matter.
vSAN 7 U3 introduced the ability to persist keys distributed to hosts through the use of a Trusted Platform Module (TPM) chip. This applies to environments running either the vSphere Native Key Provider, or an external KMS cluster. Should there ever be an issue with communication to the key provider, this will persistently cache the distributed keys to the TPM chip on the host. This cryptographically stored device will secure the key so that any subsequent reboots of the host will allow it to retrieve the assigned key without relying on the communication to the KMS.
Figure 8-12: Using a TPM with the vSphere NKP, or an external KMS in vSAN 7 U3.
Recommendation: ALWAYS include a Trusted Platform Module (TPM) for each and every server purchased. This small, affordable device is one of the best ways to improve the robustness of your encrypted vSphere and vSAN environment.
The principles around operationalization of key management for secured environments will be similar regardless of the method chosen. When considering the role of a key provider, it is important to ensure operational procedures are well understood to help accommodate for planned and unplanned events. This would include, but is not limited to tasks such as:
- Configuring a vSphere Native Key Provider
- Backing up a vSphere Native Key Provider
- Recovering a vSphere Native Key Provider
OEM vendors of full featured KMS solutions will have their own guidance on how to operationalize their solution in an environment.
Recommendation: Test out the functionality of the vSphere NKP in a virtual or physical lab environment prior to introducing it into a production environment. This can help streamline the process of introducing the NKP into production environments
Be sure to visit the vSAN FAQs on Security to learn more about protecting data through encryption with vSAN.
Summary
Whatever method of key management is used for a vSAN environment, ensuring that operational procedures are in place to account for planned and unplanned events will help prevent any unforeseen issues. The vSphere NKP introduced in vSphere 7 U2 is fully compatible with vSAN 7 U2 and later, and helps make encryption easier to implement when full-featured KMS solutions are not available or necessary for an environment.
iSCSI: Identification and Management of iSCSI Objects in vSAN Cluster
vSAN iSCSI objects can appear different than other vSAN objects in vSAN reports—usually reporting as “unassociated” because they aren’t mounted directly into a VM as a VMDK—but rather via a VM’s guest OS iSCSI initiator, into which vSAN has no visibility. If you use the vCenter RVC to query vSAN for certain operations, be aware that iSCSI LUN objects as well as vSAN performance service objects (both of which are not directly mounted into VMs) will be listed as “unassociated”—this does NOT mean they are unused or safe to be deleted.
So, how can you tell if objects are in use by the performance service or iSCSI? After logging in to the vCenter server, iSCSI objects or performance management objects could be listed and shown as unassociated when querying with RVC command vsan.obj_status_report.
These objects are not associated with a VM, but they may be valid vSAN iSCSI objects on the vSAN datastore and should not be deleted. If the intention is to delete some other unassociated objects and save space, please contact the VMware GSS team for assistance. The following shows how to identify unassociated objects as vSAN iSCSI objects and verify from the vSphere web client.
Login to vCenter via ssh and launch the RVC, then navigate to the cluster:
root@sc-rdops-vm03-dhcp-93-66 [ ~ ]# rvc administrator@vsphere.local@localhost
Welcome to RVC. Try the ‘help’ command.
0 /
1 localhost/
> cd localhost/
Run the vSAN object status report in RVC:
/localhost> vsan.obj_status_report -t /localhost/VSAN-DC/computers/VSAN-Cluster/…
Histogram of component health for non-orphaned objects
+-------------------------------------+------------------------------+
| Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
| 3/3 (OK) | 10 |
+-------------------------------------+------------------------------+
Total non-orphans: 10
Histogram of component health for possibly orphaned objects
+-------------------------------------+------------------------------+
| Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
+-------------------------------------+------------------------------+
Total orphans: 0
Total v9 objects: 10
+-----------------------------------------+---------+---------------------------+
| VM/Object | objects | num healthy / total comps |
+-----------------------------------------+---------+---------------------------+
| Unassociated objects | | |
| a29bad5c-1679-117e-6bee-02004504a3e7 | | 3/3 |
| ce9fad5c-f7ff-9927-9f58-02004583eb69 | | 3/3 |
| a39cad5c-008a-7b61-a630-02004583eb69 | | 3/3 |
| d49fad5c-bace-8ba3-9c7a-02004583eb69 | | 3/3 |
| d09fad5c-1650-1caa-d0f1-02004583eb69 | | 3/3 |
| 66bcad5c-a7b5-1ef9-0999-02004504a3e7 | | 3/3 |
| 169cad5c-6676-063b-f29e-020045bf20e0 | | 3/3 |
| f39bad5c-5546-ff8d-14e1-020045bf20e0 | | 3/3 |
| 199cad5c-e22d-32d7-aede-020045bf20e0 | | 3/3 |
| 1d9cad5c-7202-90f4-0fbf-020045bf20e0 | | 3/3 |
+-----------------------------------------+---------+---------------------------+
Cross-reference the “Unassociated objects” list UUIDs with the vSAN iSCSI objects, as well as the “iSCSI home object” and the “performance management object” in the vSphere web client under vSAN Cluster → Monitor → vSAN → Virtual Objects and compare the UUIDs under the “vSAN UUID” column with those in the “Unassociated objects” report from RVC. If UUIDs appear in both lists, they are NOT safe to remove.
FIGURE 8-10: Enumerated objects, related storage policies, and vSAN UUIDs
Again, if in any doubt, please contact the VMware GSS team for assistance.
File Services: Introducing into an Existing Environment
vSAN File Services allows for vSAN administrators to easily deliver file services on a per cluster basis using any vSAN cluster. Providing both NFS and SMB file services in a manner that is native to the hypervisor allows for a level of flexibility and ease of administration that is otherwise difficult or costly to achieve with stand-alone solutions.
Enabling vSAN File Services in an environment introduces several operational considerations. vSAN File Services can be unique in that it may require additional considerations with the infrastructure that may or may not be related to the hypervisor. Some of those considerations include:
- Supported topology types
- Authentication options through Active Directory for SMB and Kerberos for NFS
- Supported protocol versions and how to connect clients to the shares
Note that in the VMware documentation and in the product UI, the term “share” may be used interchangeably with both SMB and NFS. The term "share" is used to simplify the language when discussing multiple protocols. Windows-based SMB have historically referenced them as “shares” while Unix and Linux based systems typically refer to them as an “NFS export” that the NFS client will mount.
Recommendations on Introducing vSAN File Services into your environment
Since vSAN File Services is a relatively new feature, successfully introducing it into an environment can be achieved with preparation and familiarity.
- Run the latest edition of vSAN. File Services were introduced with vSAN 7, improved significantly in vSAN 7 U1 and in vSAN 7 U2. If you are interested in this feature, update the cluster to the latest version prior to enabling vSAN File Services.
- Understand the limits. vSAN File Services does not allow ESXi hosts to connect directly to File Services via NFS for the purpose of presenting storage for VMs. A share served by vSAN File Services can only be used for SMB, or NFS: Not both concurrently. The vSAN File Services FAQ and VMware Docs outline the common limits that you should be aware of prior to deployment.
- Become familiar with the prerequisites required for the setup of vSAN file services. Enabling and configuring vSAN File Services will require additional IP addresses for the respective protocol services containers (up to a maximum of 64 for clusters of cluster sizes of 64 running vSAN 7 U2) with forward and reverse DNS records.
- Decide on the approach used for port group used by vSAN File Services. The port group that is used for vSAN File Services will automatically enable promiscuous mode and forged transmits if those settings are not enabled already. If NSX-based networks are being used, ensure that similar settings are configured for the provided network entity from the NSX admin console, and all the hosts and File Services nodes are connected to the desired NSX-T network. An administrator may decide to put the IP addresses of protocol service container created by vSAN File Services on their own dedicated port group, or use another existing port group that previously did not have these settings enabled. Internal requirements (e.g. security) or other constraints may dictate the decision. Both approaches are supported.
- Build out a test cluster to become familiar with the deployment process and configuration settings. This will allow for easy experimentation to become familiar with the feature and the configuration. It can also serve as a way to test out the upgrade process, as well as review future editions of vSAN File Services.
- Use the test cluster to ensure the proper configuration of Active Directory. Configuration of Active Directory and Kerberos settings for vSAN File Services will be highly dependent on your organization's Active Directory Configuration. The deployment wizard also has guidance with this, including the requirements of a dedicated OU in Active Directory for use by vSAN File Services.
- Set quotas. vSAN file services can provide as much capacity for file shares as provided by the cluster. vSAN provides share warning thresholds as well as a hard quota to protect against the consumption of storage capacity beyond what is intended. This short video shows how to set the limits, and it is also covered in VMware Docs.
- Become familiar with creating shares and their associated connection strings. A connection string is the string of text an NFS or SMB client will use to establish a connection to the share. This connection string will be different for SMB, NFS v3, and NFS v4.1. Learn where to find these strings, and how to connect the clients.
- Learn how to monitor. vSAN provides the ability to monitor the activities of vSAN File Services. The share can be selected in the vSAN Performance Service to look at the demand on the share over a period of time. The Skyline Health checks also continuously check for various aspects of the cluster related to vSAN File Service health. The vCenter Server UI even allows you to see which objects make up the given file share. This can be found in the "Virtual Objects" view followed by clicking the "File Shares" icon to filter the object listing.
Recommendation: Do not use vSAN File Services as a location for important host logging, core dumps or scratch locations of the hosts that comprise the same cluster providing the file services. This could create a circular dependency and prevent the logging and temporary data from being available during an unexpected condition that requires further diagnostics.
Summary
vSAN File Services offers all-new levels of flexibility in delivering common file services to systems and users in your organization. Familiarity and testing in your own environment will help ensure that the deployment, operation, and optimization goes as planned.
Section 9: Stretched Clusters
Convert Standard Cluster to Stretched Cluster
vSAN stretched clusters can be initially configured as stretched, or they can be converted from an existing standard vSAN cluster.
Pre-conversion tasks
- vSAN Witness Host. A vSAN Witness Host must be added to vCenter before attempting to convert a vSAN cluster to a vSAN stretched cluster. This can be a physical host or a vSAN Witness Appliance.
- Configure networking. Networking requirements for a vSAN stretched cluster must be in place as well. This could include creating static routes as well as configuring Witness Traffic Separation (WTS).
- Host and policy considerations. When converting to a vSAN stretched cluster, the primary protection of data is across sites using mirroring. vSAN has the ability to protect data within a site, using a secondary protection rule. To meet the requirements of the secondary protection rule, each site must have enough hosts to satisfy the desired secondary protection rule. When choosing host counts that will be configured for each fault domain, consider that vSAN stretched clusters most often mirror data across sites. This is one of the primary benefits of a vSAN stretched cluster: a full copy of data in each site. When converting a standard vSAN cluster to a vSAN stretched cluster, understand that more hosts may be added to continue to use the same storage policies used in the standard vSAN cluster. When the standard vSAN cluster does not have enough hosts to provide the same per-site policy, different storage policies may need to be chosen.
Example: Consider the desire to convert an all-flash 6-node vSAN cluster using an erasure coding storage policy. A regular 6-node all-flash vSAN cluster can support either RAID-5 or RAID-6 storage policies. By splitting this cluster in two when converting it to a vSAN stretched cluster, the (now stretched) cluster cannot satisfy the requirements of a RAID-5 or RAID-6 storage policy. This is because each fault domain only has three member hosts. Mirroring is the only data placement scheme that can be satisfied. Adding an additional host to each fault domain would allow for a RAID-5 secondary rule and erasure coding. Adding three additional hosts to each site would meet the minimum requirements for using RAID-6 as a secondary level of protection in each site. While it isn’t necessary for vSAN stretched clusters to have the same number of hosts in each fault domain, or site, they typically do. Situations where a site locality rule is used could alter the typical symmetrical vSAN stretched cluster configuration.
Conversion
When hosts have been designated for each fault domain, the vSAN Witness Host has been configured, and networking has been validated, the vSAN cluster can be converted to a vSAN stretched cluster. The process for converting a vSAN cluster to a vSAN stretched cluster can be found on core.vmware.com. After converting the vSAN cluster to a vSAN stretched cluster, configure HA, DRS, and VM/host groups accordingly.
- HA settings
- DRS settings
- VM/host groups
Recommendations: Be sure to run through the pre-conversion tasks, as well as deploying a vSAN Witness Host beforehand. Ensure the network is properly configured for vSAN stretched clusters. Determine the storage policy capabilities of the new stretched configuration.
Convert Stretched Cluster to Standard Cluster
Moving a vSAN stretched cluster to a traditional vSAN cluster is similar to the reversed process of converting a standard vSAN cluster to a vSAN stretched cluster. Before performing the conversion process, the most important items to consider are:
- Hosts only reside in two fault domains, but these could be spread across geographically separated locations.
- A vSAN Witness Host is participating in the vSAN cluster.
- Per-site DRS affinity rules are likely in place to keep workloads on one fault domain or the other.
- vSAN stretched cluster-centric networking is in place.
Workloads should not be running during this conversion process.
If the vSAN stretched cluster is located in a single facility—such as each fault domain in a different room or across racks—this process can be completed easily. If the vSAN stretched cluster fault domains are located in geographically separate locations, hosts in the secondary fault domain need to be relocated to the same location as the hosts in the preferred fault domain, which makes the process a bit more complex.
The basic process can be found on the VMware Docs site: “Convert a Stretched Cluster to a Standard vSAN Cluster.” The basic process addresses the removal of the vSAN Witness Host and fault domains. VM/host affinity rules should be removed from the vSphere cluster UI. Any static routing configured to communicate with the vSAN Witness Host should be removed.
Recommendations:Use the vSAN health UI to repair any vSAN objects. Also be sure to remove any static routing configured to communicate with the vSAN Witness Host.
Replacing Witness Host Appliance on Stretched Cluster
vSAN storage policies with the FTT=1 using mirroring require three nodes / fault domains available to meet storage policy compliance. This means that 2-node vSAN clusters require a third host to contribute to the 2-node cluster. A vSAN Witness Host is used to provide quorum.
The vSAN Witness Host stores Witness components for each vSAN object. These Witness components are used as a tiebreaker to assist with ensuring accessibility to a vSAN object, such as a VMDK, VSWP, or namespace, in the event of a failure or network partition. One node in the preferred site of a stretched vSAN cluster is designated as the primary node. A second node, residing in the non-preferred site, is designated as the backup node. When all the nodes in one site or the other cannot communicate with the hosts in the opposite site, that site is “partitioned” from the other.
Preferred and non-preferred sites isolated from each other: If both preferred and non-preferred sites can communicate with the vSAN Witness Host, the cluster will continue to operate VMs that have cross-site storage policies in the preferred site. vSAN objects that have a Site Affinity policy only will continue to operate in the isolated site. This is because Site Affinity policies have no components residing in the preferred or Witness Host sites.
Preferred site failure or complete isolation: If the preferred is completely isolated from the non-preferred site and the vSAN Witness Host, the backup node in the non-preferred site becomes the primary with the vSAN Witness Host. Note that either site in a vSAN stretched cluster may be designated “preferred,” and may be changed when required.
The vSAN Witness Host can be a physical ESXi host or a VMware-provided virtual appliance, which is referred to as a vSAN Witness Appliance. Replacing a vSAN Witness Host is not a typical or regular task, but one that could be required in some of the following scenarios:
- Physical host is being decommissioned
- Due to ESXi CPU requirements (e.g., upgrading from vSAN 6.6 to 6.7+)
- Host is at the end of a lease lifecycle
- vSAN Witness Appliance
- Has been deleted
- Has been corrupted
- Does not meet the vSAN Witness Component count requirement in cases where a cluster has grown in component count.
Replacing a vSAN Witness Host is relatively simple when a suitable vSAN Witness Host has been configured. The fault domains section in the vSAN configuration UI provides a quick and easy mechanism to swap the currently configured vSAN Witness Host with an alternate host for this purpose.
FIGURE 9-1: Viewing the fault domains created as a result of a configured vSAN stretched cluster
As of vSAN 7, witness components will be recreated immediately after replacing a witness host appliance. For versions prior to vSAN 7, witness components can be manually recreated in the vSAN health UI using the “Repair objects immediately” operation.
As of vSAN 7 U3, a witness host appliance for a stretched cluster can now be updated using vLCM. See the post: "vSphere Lifecycle Manager in vSAN 7 U3" for more details.
More detailed information can be found in the vSAN Stretched Cluster Guide here: “Replacing a Failed Witness Host.”
Configure a Stretched Cluster
When deploying a vSAN stretched cluster, it is important to understand that it is a bit different than a traditional vSAN cluster.
Overview of stretched clusters
A stretched cluster architecture includes one or more nodes in two separate fault domains for availability and a third node, called the vSAN Witness Host, to assist in the event the fault domains are isolated from each other.
Traditional vSAN clusters are typically in a single location. Stretched cluster resources are typically located in two distinct locations, with the tiebreaker node in a third location. Because hosts are in different sites, a few items require additional consideration when compared to traditional vSAN clusters:
- The vSAN Witness Host
- Network bandwidth and topology
- VM deployment and placement
- Storage policy requirements
As of vSAN 7 U2, the maximum size of a stretched cluster is 40 hosts total: 20 hosts on each site, and one witness host appliance. Previous to vSAN 7 U2, the maximum size was 30 hosts.
vSAN Witness Host
The vSAN Witness Host can be physical or the VMware-provided vSAN Witness Appliance. This “host” must have proper connectivity to vCenter and to each host in the vSAN cluster. Communication with the vSAN Witness Host is always Unicast and requires additional ports between itself and the vSAN cluster. vSAN Witness Host deployment and configuration can be found on core.vmware.com: “Using a vSAN Witness Appliance.”
Backup and restore, clones, snapshots,vMotion, and replication of a vSAN Witness Host are not supported. A new witness host virtual appliance should be deployed using the 'Change witness host' option in the vSphere Client when there is an issue with the existing witness host.
Have you ever wondered why a witness host is necessary for a stretched cluster or 2-node environment? See the post "Why do I need independent connections to the Witness Appliance in a VMware vSAN stretched cluster?" for a great explanation.
Network Topology Bandwidth
Inter-site bandwidth must be sized to allow proper bandwidth for the expected workload. By default, data is only written across fault domains in a vSAN stretched cluster. The write I/O profile should determine the required inter-site bandwidth. VMware recommends the inter-site bandwidth to be 1.75x the write bandwidth. This takes data, metadata, and resynchronization bandwidth requirements into account. More detail can be found on core.vmware.com as it relates to inter-site bandwidth: “Bandwidth Calculation.”
Note that with the vSAN ESA, the official minimum for the ISL remains the same as the OSA, but with the higher performance potential it may be best to re-evaluate your ISL capabilities. For more information, see the post: "Using the ESA in a Stretched Cluster Topology."
Topology
While Layer 2 networking is often used for inter-node vSAN communication, Layer 3 is often used to address the vSAN Witness Host that typically resides in a different location. Traditional vSAN stretched clusters required additional, often complex routing to allow the vSAN nodes and vSAN Witness Host to communicate. Each vSAN interface needed the same MTU setting as every other vSAN interface. Metadata, such as vSAN component placement and fault domain participation, are shared to and from the vSAN Witness Host and vSAN data nodes.
vSAN also feature a Witness Traffic Separation (WTS) capability. This directs the communication with the vSAN Witness Host to a different VMkernel port on a vSAN data node, simplifying overall network configuration. vSAN also supports different MTU settings for the data and metadata networks. This benefits customers wishing to deploy a vSAN Witness Host over a slower network while still using jumbo frames for inter-node communication.
VM deployment and placement
Because a vSAN stretched cluster appears to vCenter as a single cluster, no specific intelligence will deploy or keep a VM in one site or another. After deploying a VM on a vSAN stretched cluster, it is important to associate the VM with the site it will normally run in. VM/host groups can be created to have VMs run in one site or the other by using VM/host rules.
FIGURE 9-2: Creating a VM/host rule for use with DRS site affinity settings
Storage policy requirements
In addition, the site-level protection on a per-VM or VMDK basis offered by a stretched cluster configuration, vSAN provides the ability, via assigned storage policy, to prescribe an additional level of local protection. For this additional level of local protection, each site would have its own minimum hosts requirements to meet that desired level of redundancy. This is detailed here: “Per Site Policies.”
Recommendations: Deploy a vSAN Witness Host/Appliance before attempting to configure the vSAN stretched cluster. Do not configure management and vSAN traffic interfaces on the same network. The exception here is that the management interface can be tagged for vSAN traffic instead of the secondary interface. Choose the proper profile for the number of components expected when deploying a vSAN Witness Appliance. Ensure MTU sizes are uniform across all vSAN interfaces. If using WTS, be certain to configure VMkernel interfaces for Witness traffic to ensure connectivity to the vSAN Witness Host. And finally, use vmkping to test connectivity when static routes are required.
Setting DRS Affinity Settings to Match Site Affinity Settings of Storage Policies
vSAN provides the ability for site affinity to allow for data to reside in a single site. The site disaster tolerance rule in the vSphere Client will allow data to reside only on one fault domain or the other. The following applies to stretched clusters running the OSA and ESA.
FIGURE 9-3: Configuring a storage policy rule to be used with site affinity
Using this storage policy configuration will pin only data to the preferred or non-preferred site; it will not pin the VM to either site. VM/host groups as well as VM/host rules must be configured to ensure that the VM only runs on hosts in the same site as the data.
FIGURE 9-4: Configuring a VM/Host rule for use with DRS site affinity settings
If rules are not configured to prevent the VM from moving to the alternate site, the VM could possibly vMotion to the other site, or restart on the other site as a result of an HA event. If the VM runs in the opposite site from its data, it must traverse the inter-site link for all reads and writes, which is not ideal.
vSAN 7 U2 introduced improved DRS handling with stretched clusters. After a recovered site failure or partition condition, DRS will keep the VM state at the same site until data is fully resynchronized, which will ensure that all read operations do not traverse the inter-site link (ISL). It will do this on a per VM basis, and ensure that the data path is optimal.
Recommendation: Create VM/host groups and VM/host rules with the “Must run on hosts in group” setting.
For the very latest recommended cluster settings for DRS in a vSAN stretched cluster, refer to Cluster Settings - DRS in the vSAN Stretched Cluster Guide on core.vmware.com.
HA Settings for Stretched Clusters
vSphere HA will restart VMs on alternate hosts in a vSphere cluster when a host is isolated or has failed. HA is used in conjunction with vSAN to ensure that VMs running on partitioned hosts are restarted on hosts actively participating n the cluster.
When configuring HA failures and responses, hosts should be monitored, with VMs being restarted upon a host failure and VMs powered off and restarted when a host is isolated. These settings ensure that VMs are restarted in the event a VM fails or a host is isolated.
FIGURE 9-5: Recommended HA settings for vSAN stretched clusters
Admission control takes available resources into account should one or more hosts fail. In a vSAN stretched cluster, host failover capacity should be set at 50%. The total CPU and memory use for workloads on a vSAN stretched cluster should be never be more than a single site can accommodate. Adhering to this guideline ensures sufficient resources are available should a site become isolated.
FIGURE 9-6: Configuring admission control for a vSAN stretched cluster
Heartbeat datastores are not necessary in vSAN and should be disabled.
FIGURE 9-7: Disabling heartbeat datastores with a vSAN stretched cluster
Advanced options should disable using the default isolation address. An isolation address for each fault domain should be configured.
FIGURE 9-8: Configuring site based isolation addresses for a vSAN stretched cluster
Recommendations: Host failure responses should be set to "Restart VMs" with a response for host isolation set to "Power off and restart VMs"
For the very latest recommended cluster settings for HA in a vSAN stretched cluster, refer to Cluster Settings - vSphere HA in the vSAN Stretched Cluster Guide on core.vmware.com
Creating a Static Route for vSAN in Stretched Cluster Environments
Nodes in a vSAN cluster communicate with each other over a TCP/IP network. vSAN communication uses the default ESXi TCP/IP stack. A default gateway is the address of a designated Layer 3 device used to send TCP/IP traffic to an address outside a Layer 2 network. Standard vSAN cluster configurations often use a Layer 2 network configuration to communicate between nodes. This configuration has no requirement to communicate outside the Layer 2 network. In some situations, where it is required (or desired) to use a Layer 3 network for vSAN communication, static routes are required. The use cases include:
- Stretched cluster vSAN
- 2-node vSAN
- vSAN nodes configured in different Layer 3 networks
In versions prior to vSAN 7 U1, static routes are required to ensure the vSAN-tagged VMkernel interface can properly route Layer 3 traffic. Without static routing in place, the vSAN-tagged VMkernel interface will attempt to use the default gateway for the management VMkernel interface. An example of using static routing can be seen in FIGURE 9-9.
FIGURE 9-9: Defining static routes in a vSAN stretched cluster
The vSAN-tagged VMkernel interfaces must communicate with the vSAN Witness Host, which is only accessible over Layer 3. Static routes ensure data can properly flow from the vSAN “backend” to the vSAN Witness Host. In vSAN 7 U1 and later, static routes are no longer necessary, as a default gateway override can be set for both stretched cluster and 2-node configurations. This removes the manual, static routing configuration steps performed at the command line through ESXCLI or PowerCLI, and provides an improved level of visibility to the current host setting.
Recommendation. Upgrade to vSAN 7 U1 to simplify the initial configuration and ongoing operation of stretched cluster environments.
More information specific to vSAN stretched cluster network design considerations can be found in the Stretched Cluster Guide on core.vmware.com.
Non-Disruptive Maintenance of One Site in a Stretched Cluster Environment
A vSAN stretched cluster uses a topology that provides an active-active environment courtesy of two site locations. It uses the vSAN data network to allow for VM data to be stored synchronously across two sites, and vMotion to provide the ability for the VM instance to move from one site to the other without disruption. Maintenance on vSAN hosts in a stretched cluster can occur in the same manner as a standard vSAN cluster: A discrete host is entered into maintenance mode, with activities performed, then back in production after exiting maintenance mode. In a stretched cluster, there may be times where maintenance to an entire site may be necessary: Perhaps there is new switchgear being introduced, or some other maintenance to the inter-site link. If there are sufficient compute and storage capacity resources to accommodate all workloads, a stretched cluster environment can allow for site maintenance to occur on one site without interruption.
The guidance below is applicable to both OSA and ESA-based stretched clusters.
Note that when a planned or unplanned site outage in a vSAN stretched cluster occurs, this places additional importance on the uptime of the virtual witness host appliance. Under conditions of a site outage, it is the witness host appliance that maintains quorum to provide data availability for the site remaining active. Any subsequent planned or unplanned outage of the witness host appliance after the planned or unplanned outage of a data site would result in data unavailability. vSAN 7 U3 improves the uptime for stretched clusters (and 2-node topologies) by using a technique known as Adaptive Quorum Control. In cases where there is a planned or unplanned outage at a data site, followed by a planned or unplanned outage of the witness host appliance a short time thereafter, Adaptive Quorum Control will maintain data availability - assuming there was sufficient time to recalculate the votes for each object. If the the unplanned or planned outages occur nearly simultaneously, this is considered a double failure, and the data would not be available until one of the entities was brought back online. For more information on Adaptive Quorum Control, see the post: Improved Uptime for Stretched Cluster and 2 Node clusters.
Site Maintenance procedure
The maintenance of a single site of a stretched cluster can be performed in a non-disruptive manner. All virtual workloads maintain uptime during the maintenance activity, assuming sufficient resources to do so.
Recommendation: Temporarily shutting down non-critical workloads will provide additional relief of minimizing traffic across any participating inter-site links.
The guidance for site maintenance of a stretched cluster assumes the following conditions:
- Sufficient resources to run all workloads in just one site. If there is concern of this, some VMs could be temporarily powered down to reduce resource utilization.
- The vSAN stretched cluster configuration is enabled and functioning without any error conditions.
- All VMs use a storage policy that protect across sites. For example, the storage policy uses the option of "site disaster tolerance" of "Dual site mirroring (stretched cluster)" using an FTT=1.
- vMotion between sites is operating correctly.
- The appropriate DRS host groups for each site are configured and contain the correct hosts in each site
- The appropriate DRS VM groups for each site are configured and contain the correct VMs to be balanced across sites
- DRS site affinity "should run" rules are in place to define the preference of location that the VMs should run in a non-failure scenario.
Assuming the conditions described above, the general guidance for providing site maintenance is as follows. Note that any differences in the assumptions described may alter the steps accordingly.
- Document existing host groups, VM groups, and VM/Host rules in DRS.
- Update DRS site affinity rules for all VM groups defined in DRS. Ensure that a.) they are changed from "should run" to "must run" and b.) set the respective host group in the VM/Host rule to the site that will remain up during the maintenance. DRS migrations will begin shortly after these changes are made.
- Allow for DRS to migrate the workloads to complete and confirm that all of the workloads are running in the correct site, and still compliant with their intended storage policy.
- In the site selected for maintenance, begin to place those hosts into maintenance mode. Either "Ensure Accessibility" or "No data migration" options can be chosen, as in this scenario they will behave the same if all data is in fact mirrored across sites.
- Perform the desired maintenance at the site.
- Once complete, take all hosts out of maintenance mode.
- Wait for resynchronizations across sites to begin, and complete before proceeding to the next step, which will help minimize inefficient data paths from VMs prematurely moving back to their respective sites prior to data being fully resynchronized.
- Once resynchronizations are complete, change the settings that were modified in step #2 back to their original settings. VMs will migrate back to their respective sites based on DRS timing schedules. Allow for DRS migrations to complete and verify the effective result matches the intentions.
Recommendation: Double-check which hosts participate in each site of the stretched cluster. It can often be easy to accidently select the wrong hosts when entering an entire site into maintenance mode.
Another optional step to control the timing of DRS activities would be through temporarily changing the setting from "fully automated" to "partially automated." This would be a temporary step that would need to be returned to its original value after maintenance is complete.
If your stretched cluster is using site mirroring storage policies, and organization is uncomfortable with reducing the level of resilience during this maintenance period, you may wish to consider introducing storage policies that use secondary levels of protection: E.g. Dual site mirroring with an additional level of FTT applied at the host level in each site. Resilience during site maintenance would be reduced, but would still provide resilience on a host level during this maintenance period. If this is of interest, it is recommended that these storage policies be adjusted well prior to any planned site maintenance activities so that vSAN has the opportunity to apply storage policies that use a secondary level of protection to all objects in the cluster.
Summary
vSAN stretched clusters allow for administrators to perform site maintenance while providing full availability of data. The exact procedures may vary depending on environmental conditions, but the guidance provided here can serve as the foundation for site maintenance activities for a vSAN stretched cluster.
Decommission a vSAN Stretched Cluster
Decommissioning a vSAN stretched cluster is not unlike a decommissioning of a standard or 2-node vSAN cluster. The decommissioning process most often occurs when business changes require an adjustment to the underlying topology. Perhaps a few smaller vSAN clusters were going to be merged into a larger vSAN cluster, or maybe the cluster is simply being decommissioned due to replacement vSAN cluster already deployed into production.
Note that this task should not be confused with converting a vSAN stretched cluster to a standard vSAN cluster. That guidance is documented already in section 9 of this Operations Guide.
Key Considerations
Since this task involves permanent shutdown of the hosts that make up a vSAN cluster, an assumption is that all VMs and data has been migrated off of this cluster at some point. It will be up to the administrator to verify that this prerequisite has been performed.
Once the hosts are no longer housing any VM data, the vSAN performance service should also be disabled. The vSAN performance service houses its performance data much like a VM does, thus, it must be disabled in order to properly decommission the respective disk groups in the hosts that comprise the cluster.
Hosts can be decommissioned from the cluster by first entering them into maintenance mode. From there, disk groups can be deleted, which will clear the vSAN metadata on all of the capacity devices in the disk group. Disk groups can also be deleted using the vSphere host client and PowerCLI. Examples of such cases can be found in the PowerCLI Cookbook for vSAN at: https://vmware.com/go/powercli4vsan
Recommendation: Ensure that all workloads and respective data has been fully evacuated from the vSAN stretched cluster prior to decommissioning. With vSAN 7, this could include many other data types beyond VMs, including first-class disks for cloud native storage, NFS shares through vSAN file services, and iSCSI LUNs through the vSAN iSCSI service. The decommissioning process will be an inherently destructive process to any data remaining on the disk groups in the hosts that are being decommissioned.
Summary
The exact steps of decommissioning of a vSAN cluster remains similar across all types of vSAN topologies. The area of emphasis should be in ensuring that all of the VMs and data housed on a vSAN cluster has been fully evacuated prior to any decommissioning begins.
Section 10: 2-Node
Replacing vSAN Witness Host in 2-Node Cluster
vSAN storage policies with the FTT=1 using mirroring require 3 nodes / fault domains available to meet storage policy compliance. This means that 2-node vSAN clusters require a third host to contribute to the 2-node cluster. A vSAN Witness Host is used to provide quorum. The vSAN Witness Host stores Witness components for each vSAN object. These Witness components are used as a tiebreaker to assist with ensuring accessibility to a vSAN object, such as a VMDK, VSWP, or namespace, in the event of a failure or network partition.
One of the two nodes in the cluster is designated as the preferred node. When the two nodes in the cluster cannot communicate with each other, they are considered partitioned from each other.
- Data nodes isolated from each other: If both the preferred and non-preferred nodes can communicate with the vSAN Witness Host, the cluster will continue to operate with the preferred node and the vSAN Witness Host while the non-preferred node is isolated.
- Preferred node failure or complete isolation: If the preferred node fails or is completely isolated from the non-preferred node and the vSAN Witness Host, the non-preferred node becomes the primary with the vSAN Witness Host. Note that either node in a 2-node cluster may be designated to be the preferred and may be changed when required.
This third host can be a physical ESXi host or a VMware-provided virtual appliance, which is referred to as a vSAN Witness Appliance.
Replacing a vSAN Witness Host is not a typical or regular task, but one that could be required in some of the following scenarios:
- Physical host is being decommissioned
- Due to ESXi CPU requirements (e.g., upgrading from vSAN 6.6 to 6.7+)
- Host is at the end of a lease lifecycle
- vSAN Witness Appliance
- Has been deleted
- Has been corrupted
- Does not meet the vSAN Witness Component count requirement in cases where a cluster has grown in component count
The process of replacing a vSAN Witness Host is relatively simple when a suitable vSAN Witness Host has been configured. The fault domains section in the vSAN configuration UI provides a quick and easy mechanism to swap the currently configured vSAN Witness Host with an alternate host for this purpose.
FIGURE 10-1: Using the “Change Witness Host” option to replace the currently configured witness host in stretched cluster
As of vSAN 7, witness components will be recreated immediately after replacing a witness host appliance. For versions prior to vSAN 7, witness components can be manually recreated in the vSAN health UI using the “Repair objects immediately” operation.
As of vSAN 7 U3, a witness host appliance for a 2-node cluster can be updated using vLCM. This is limited to 2-node topologies that use a dedicated witness host appliance, and do not share a witness host appliance across multiple 2-node clusters. See the post: "vSphere Lifecycle Manager in vSAN 7 U3" for more details.
More detailed information can be found in the vSAN 2 Node Guide here: “Replacing a Failed vSAN Witness Host.”
Configure a 2-Node Cluster
2-node vSAN clusters do share some similarities to vSAN stretched clusters, but with a few design and operational differences. Note that design and operational difference evolve as new versions of vSAN are released. See "New Design and Operation Considerations for vSAN 2-Node Topologies" for an example.
Similarities to stretched clusters
2-node vSAN clusters inherit the same architecture as a vSAN stretched cluster. The stretched cluster architecture includes nodes in two separate fault domains for availability and a third node, called the vSAN Witness Host, to assist in the event they are isolated from each other. Traditional vSAN clusters are typically located in a single location. Stretched cluster resources are typically located in two distinct locations, with the tiebreaker node in a third location. 2-node vSAN cluster resources are often in a single location, but the vSAN Witness Host often resides in a distinct alternate location.
vSAN Witness Host
The vSAN Witness Host can be physical or the VMware-provided vSAN Witness Appliance. This “host” must have proper connectivity to vCenter and to each host in the vSAN cluster. Communication with the vSAN Witness Host is always Unicast and requires additional ports open between itself and the vSAN cluster.
Unlike vSAN stretched clusters, 2-node vSAN topologies can "share" a witness host appliance across multiple 2-node environments: Meaning that a single witness host appliance can provide the witness host functionality for up to a maximum of 64 2 node clusters. This capability was introduced in vSAN 7 U1, and improves resource efficiency by minimizing the number of virtual witness host appliances running at the primary data center. Due diligence should be taken to determine if, and to what degree the consolidation ratio of witness host appliances should occur for an environment. The witness host appliance provides a very important role in determining the site quorum and availability under a fault condition. Consolidating this responsibility to a single witness host appliance increases the "dependency domain" for these sites.
Recommendation: Balance the desire to consolidate witness host appliances with the implications of increasing the dependency domain for the 2-node clusters using a shared witness. While you can share a single witness with up to 64 2-node clusters, a smaller consolidation ratio - which would reduce the size of the dependency domain - may be more appropriate for your particular business requirements and risk tolerance.
vSAN Witness Host networking
Connectivity to the vSAN Witness Host from the vSAN 2-node cluster is typically over Layer 3 networking. In vSAN 6.5 the WTS feature was introduced for 2-node to allow for cluster and Witness communication when it is desirable for a 2-node cluster to have data nodes communicate directly connected to each other. The vSAN Witness Host will only ever have a VMkernel interface tagged for vSAN Traffic. vSAN uses the same TCP stack as the management VMkernel interface. When using Layer 3 to communicate with a vSAN cluster, static routing is required. It is always best to isolate management traffic from workload traffic. Organizations may choose to have vSAN Witness Host communication with vSAN clusters performed over the same network. This is supported by VMware but should align with an organization’s security and risk guidance. If the management VMkernel interface is tagged for vSAN traffic, static routing is not required.
Additionally, when configuring a vSAN Witness Host to communicate with a vSAN cluster, if communication with the vSAN cluster is performed using a separate VMkernel interface, that interface cannot be on the same network as the management interface. A multi-homing issue occurs, causing vSAN traffic to use the untagged management interface. This is not a vSAN-centric issue and is detailed in-depth in “Multi-homing on ESXi/ESX.”
In vSAN 7 U1 and later, static routes are no longer necessary, as a default gateway override can be set for both stretched cluster and 2-node configurations. This removes the manual, static routing configuration steps performed at the command line through ESXCLI or PowerCLI, and provides an improved level of visibility to the current host setting.
Recommendation. Upgrade to vSAN 7 U1 to simplify the initial configuration and ongoing operation of stretched cluster environments.
vSAN Witness Host sizing
The vSAN Witness Appliance can deployed using one of three profiles: tiny, normal, and large. Each of these are sized according to the number of vSAN components it can support. The tiny profile is typically used with 2-node vSAN clusters, as they seldom have more than 750 components.
Networking configuration settings
Before configuring a 2-node vSAN cluster, it is important to determine what type of networking will be used. The vSAN Witness Host must communicate with each vSAN cluster node. Just as the vSAN Witness Host has a requirement for static routing, vSAN data nodes also require static routing to communicate with the vSAN Witness Host.
WTS
Configurations using WTS should have Witness traffic configured before using the vSAN wizard or Cluster Quickstart in the vSphere UI.
vSAN VMkernel interfaces are required to have the same MTU configuration. In use cases where the vSAN Witness Host is in a different location, technologies such as an IPSEC VPN may be used. Overhead, such as the additional headers required by an IPSEC VPN, could reduce the MTU value across all nodes, as they are required to have the same MTU value. vSAN 6.7 Update 1 introduced support for mixed MTU sizes when using WTS. Witness-tagged VMkernel interface MTUs must match the MTU of the vSAN-tagged VMkernel interface on the vSAN Witness Host. Having the vSAN Witness Host and networking in place before creating a 2-node vSAN cluster will help ensure the configuration succeeds.
Recommendations: Deploy a vSAN Witness Host/Appliance before attempting to configure the 2-node cluster, and do not configure management and vSAN traffic interfaces on the same network. The exception here is that the management interface can be tagged for vSAN traffic instead of the secondary interface.
Decommission a vSAN 2-Node Cluster
Decommissioning a 2-node vSAN cluster is not unlike a decommissioning of a standard or vSAN stretched cluster. The decommissioning process most often occurs when business changes require an adjustment to the underlying topology. Perhaps a few smaller vSAN clusters were going to be merged into a larger vSAN cluster, or maybe the cluster is simply being decommissioned due to replacement vSAN cluster already deployed into production. Note that this task should not be confused with converting a 2-node cluster to a standard vSAN cluster.
Key Considerations
Since this task involves permanent shutdown of the hosts that make up a vSAN cluster, an assumption is that all VMs and data has been migrated off of this cluster at some point. It will be up to the administrator to verify that this prerequisite has been performed. Once the hosts are no longer housing any VM data, the vSAN performance service should also be disabled. The vSAN performance service houses its performance data much like a VM does, thus, it must be disabled in order to properly decommission the respective disk groups in the hosts that comprise the cluster.
Hosts can be decommissioned from the cluster by first entering them into maintenance mode. From there, disk groups can be deleted, which will clear the vSAN metadata on all of the capacity devices in the disk group. Disk groups can also be deleted using the vSphere host client and PowerCLI. Examples of such cases can be found in the PowerCLI Cookbook for vSAN at: https://vmware.com/go/powercli4vsan
Once the 2-node cluster is fully decommissioned, then the witness host appliance can be that was responsible for providing quorum for the 2-node cluster can also be decommissioned. Note that if the witness host appliance is being used for multiple 2-node topologies, DO NOT decommission the witness host appliance.
Recommendation: Ensure that all workloads and respective data has been fully evacuated from the vSAN stretched cluster prior to decommissioning. With vSAN 7, this could include many other data types beyond VMs, including first-class disks for cloud native storage, NFS shares through vSAN file services, and iSCSI LUNs through the vSAN iSCSI service. The decommissioning process will be an inherently destructive process to any data remaining on the disk groups in the hosts that are being decommissioned.
Summary
The exact steps of decommissioning of a vSAN cluster remains similar across all types of vSAN topologies. The area of emphasis should be in ensuring that all of the VMs and data housed on a vSAN cluster has been fully evacuated prior to any decommissioning begins.
Creating a Static Route for vSAN in a 2-Node Cluster
Standard vSAN cluster configurations most often used a single layer 2 network to communicate between hosts. Thus, all hosts are able to communicate with each other without any layer 3 routing. In other vSAN cluster configurations, such as stretched clusters and 2-node clusters, vSAN hosts must be able to communicate with other hosts in the same cluster, but living in a different layer 3 network. vSAN uses a dedicated VMkernel that cannot use its own default gateway. Therefore, in these topologies, static routes must be set to ensure that vSAN traffic can properly communicate with all hosts in the vSAN cluster, regardless of what layer 3 network is used.
In vSAN 7 U1 and later, static routes are no longer necessary, as a default gateway override can be set for both stretched cluster and 2-node configurations. This removes the manual, static routing configuration steps performed at the command line through ESXCLI or PowerCLI, and provides an improved level of visibility to the current host setting.
Recommendation. Upgrade to vSAN 7 U1 to simplify the initial configuration and ongoing operation of stretched cluster environments.
In a 2-node arrangement, these static routes ensure that vSAN tagged VMkernel traffic can properly route to their intended destination. Without the static route, the vSAN tagged traffic will attempt to use the default gateway of the management VMkernel interface, which may result in no communication, or reduced performance as a result of traversing over a non-optimal path.
Configuring static routes
Static routing must be configured on each host in the participating vSAN cluster. It can be configured using the esxcfg-route command line utility or using PowerCLI with the Set-VMHostRoute cmdlet.
FIGURE 10-2: Using the esxcfg-route command line utility
After setting a static route, ensure the VMkernel interface can use vmkping to communicate with a destination address that requires the static route.
FIGURE 10-3: Verifying with vmkping
Recommendation: If you are running into network partition health check alerts, or any other issues discovered in the vSAN Health checks, review and verify all static route settings for all hosts in the cluster. A large majority of issues seen by VMware Global Supports Services for stretched cluster and 2-node configurations are a result of improper or no static routes set on one or more hosts in the cluster.
Summary
The flexibility of vSAN allows for it to provide storage services using many different topologies. When a topology (such as stretched clusters, or 2-node configurations) is using more than one layer 3 network, the use of static routes on the ESXi hosts in the vSAN cluster is a critical step to ensure proper communication between hosts.
HA Settings for 2-Node vSAN Clusters
vSphere High Availability (HA) is what provides the failure handling of VMs running on given host in a vSphere cluster. It will restart VMs on other hosts in a vSphere cluster when a host has entered some form of a failure, or isolation condition.
vSAN and vSphere HA work together to ensure that VM's running in a partition host or hosts are restarted on the hosts that have quorum. vSAN 2-Node configurations use custom HA setting because of their unique topology. While this is typically an "initial configuration" item for a cluster, it is not uncommon to see this set incorrectly and it should be a part of typical operating procedures and checks for a 2-Node cluster.
Recommendation: Use vLCM in 2-Node configurations to simplify operations. vLCM will automatically disable and re-enable HA during the upgrade process, which VUM is unable to do
vSphere HA Settings
For 2-Node vSAN environments, the "Host Monitoring" toggle should be enabled. VM's being restarted upon a host failure, and VM's should be powered off and restarted when a host is isolated. These settings ensure that VM's are restarted in the event a VM fails, or a host is isolated.
FIGURE 10-4: Cluster HA settings
Admission Control accounts for the availability of resources should one or more hosts fail. In a 2-Node vSAN cluster, since only a single host can fail and still maintain resources, the host failover capacity should be set to 50%. The total utilization of CPU and Memory for workloads on a 2-Node vSAN Cluster should be never be more than a single node can accommodate. Adhering to this guideline will ensure sufficient resources are available should a node fail.
FIGURE 10-5: Cluster HA Admission Control settings
When attempting to place a host in Maintenance Mode in a 2-Node vSAN Cluster, the above Admission Control settings will prevent VM's from moving to the alternate host.
FIGURE 10-6: Admission Control warnings
To allow a host to go into Maintenance Mode, it is necessary to either Disable HA completely, or Disable Admission Control temporarily while maintenance is being performed.
FIGURE 10-7: Temporarily disabling Admission Control
Recommendation: Do not attempt to set an isolation address for a 2-Node vSAN cluster. Setting of an isolation address is not applicable to this topology type.
Summary
2-Node vSAN topologies use HA configuration settings that may be unique in comparison to traditional and stretched cluster environments. In order to ensure that a 2-Node environment behaves as expected under a host failure condition, be sure that these settings are configured correctly.
Advanced Options for 2-Node Clusters
vSAN stretched clusters are designed to service reads from the site in which the VM resides in order to deliver low read latency, and reduce traffic over an inter-site link. 2-Node vSAN clusters are built on the same logic as a stretched cluster. One can think of each host as a single site, but residing in the same physical location. In these 2-Node configurations, servicing the reads from just a single host would not reflect the capabilities of the topology (2 nodes, directly connected). Fortunately, there are settings available to ensure that reads are serviced optimally in this topology
Advanced Options
When configuring 2-Node clusters, disabling the "Site Read Locality" will allow reads to be serviced by both hosts. Disabling "Site Read Locality" is the preferred setting for 2-Node configurations, whereas enabling "Site Read Locality" is the preferred setting for stretched cluster configurations.
FIGURE 10-8: The Site Read Locality cluster setting
Both all-flash and hybrid based 2-Node vSAN clusters will benefit from disabling "Site Read Locality." All-flash vSAN cluster use 100% of the cache devices as a write buffer, but read requests will also check the write buffer prior to the capacity devices. Allowing reads to be serviced from both hosts means that there is a much higher likelihood that read requests could be served by recently written data in the buffer across both hosts, thereby improving performance.
Recommendation: For environments with dozens or even hundreds of 2-Node deployments, use PowerCLI to periodically check and validate recommended settings such as "Site Read Locality" and vSphere HA settings.
Summary
vSAN 2-Node topologies borrow many of the same concepts used for vSAN stretched clusters. There are subtle differences in the topology of a 2-Node topology that necessitates the adjustment of the "Site Read Locality" setting for optimal performance
Section 11: vCenter Server Maintenance and Event Handling
Upgrade Strategies for vCenter Server Powering One or More vSAN Clusters
It is common for vCenter to host multiple vSAN clusters, these could be at different ESXi versions to each other, and as such, different vSAN versions. This is a fully supported configuration, but it is a good idea to ensure that your vCenter and ESXi versions are compatible with one-another.
The software compatibility matrix easily shows what versions of ESXi are compatible with your target vCenter. (As an example, ESXi 5.5 and vSAN 5.5 are not compatible with vCenter 6.7 U2—so that will need upgraded to at least 6.0 first.)
FIGURE 11-1: Interoperability checks courtesy of VMware’s software compatibility matrix
Always update your vCenter before your ESXi hosts. If you intend to migrate to vSAN 8—and your current hosts are on a mixture of 7 U1 and 7 U2 — you would upgrade your vCenter to 8 (or whatever is the latest version) before upgrading your ESXi hosts.
The recommendation to always update vCenter prior to vSAN hosts to the next major version still applies. The reason for this is that, as versions increase and API versions and compatibilities change, the only way to guarantee that Center can still make sense of its communications with the ESXi hosts after upgrade is if they are on the same software version.
An example that would break compatibility would be if you had a cluster at 7 U2 and vCenter at 7 U2, if you upgraded the hosts to ESXi 8 U1 while the vCenter was still at 7 U2, then vCenter would have no way of knowing how to talk to the upgraded hosts if API calls changed between versions, which is likely. Good upgrade hygiene for VMware solutions is threefold. First, verify that your vSAN host’s components are supported on the target version of ESXi and vSAN by checking the vSAN Hardware Compatibility List (HCL) and if any firmware upgrades are required, do those.
Next, make sure that all your integrated components (such as Aria Automation, NSX, Aria Operations, and vCloud Director) are all compatible with the target version of vCenter and ESXi by checking the product interoperability matrix. Finally, once you have established the product versions for each solution you have in your environment, you can follow the upgrade workflow/guide to ensure all components are upgraded in the right order.
Replacing vCenter Server on Existing vSAN Cluster
There may be instances in which you need to replace the vCenter server that hosts some vSAN clusters. While vCenter acts as the interaction point for vSAN and is used to set it up and manage it, it is not the only source of truth and is not required for steady-state operations of the cluster. If you replace the vCenter server, your workloads will continue to run without it in place.
FIGURE 11-2: Understanding the role of vCenter in a vSAN cluster
Replacing the vCenter server associated with a vSAN cluster can be done, but is not without its challenges or requisite planning. There is an excellent blog covering the operations behind this, and there was a Knowledge Base article recently released with the process listed in detail. Below are the broad strokes of a vSAN cluster migration to a new vCenter server (taken from the Knowledge Base).
Scenario: An all-flash vSAN cluster on 7 U2 needs migrated to a new vCenter server—DD&C enabled. Given the version of vSAN noted, this is using the original storage architecture (OSA) in vSAN.
- Ensure the target vCenter server has the same vSphere version as the ESXi hosts, or higher (same is preferable).
- Create a new cluster on the new vCenter server with the same settings as the source cluster (vSAN enabled, DD&C, encryption, HA, DRS) and ensure the Disk Claim Mode is set to Manual.
- If you are using a Distributed Switch on the source vCenter server, export the vDS and import it into the new vCenter server, ensure “Preserve original distributed switch port group identifiers” is NOT checked upon import.
- Recreate all SPBM policies on the target vCenter server to match the source vCenter server.
- Disconnect all hosts from the source vCenter server.
- Remove hosts from the source vCenter server inventory.
- Add hosts into the new vCenter server.
- Drag the hosts into the new cluster.
- Verify hosts and VMs are contactable.
- Run esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates on all hosts.
- Configure hosts to use the imported vDS one by one, ensuring connectivity is maintained.
- Reconfigure a VM with the same policy as source—ensuring no resynchronization when the VM is reconfigured.
- For each SPBM policy, reconfigure one of each VM as a test to ensure no resynchronization is performed.
- Once verified, reconfigure all VMs in batches with their respective SPBM policies.
Recommendation: If you are not completely comfortable with the above procedure and doing this in a live environment, please open a ticket with GSS and have them guide you through the procedure.
Summary
Replacing a vCenter server for an existing vSAN cluster is an alternate method for restoring vCenter should a backup of vCenter not be available. With a little preparation, a vCenter server can be replaced with a clean installation and the vSAN management plane will continue to operate as expected.
Protecting vSAN Storage Policies
Storage policy based management (SPBM) is a key component of vSAN. All data that is stored on a vSAN cluster: VM data, file services shares, first-class disks, and iSCSI LUNs are stored in the form of objects. Each one of the objects stored in a vSAN cluster have an assignment of a single storage policy that helps define the outcome that governs how the data is placed.
Storage policies are a construct of a vCenter server. Similar to vSphere Distributed Switches (vDS), they are defined and stored on a vCenter server, and can be applied to any supporting cluster that the vCenter server is managing. Therefore, when replacing a vCenter server in an already existing cluster (as described in "Replace vCenter server on existing vSAN cluster" in this sect ion of the operations guide), the storage policies will either need to be recreated, or imported from a previous time in which they were exported from vCenter.
Protecting storage policies in bulk form will simplify the restoration process, and will help prevent unnecessary resynchronization from occurring due to some unknown difference in the storage policy definition.
Procedures
The option of exporting and importing storage policies is not available in the UI, but a simple PowerCLI script will be able to achieve the desired result. Full details and additional options for importing and exporting policies using PowerCLI can be found in the PowerCLI Cookbook for vSAN.
Back up all storage policies managed by a vCenter server:
# Back up all storage policies
# Get all of the Storage Policies
$StoragePolicies = Get-SpbmStoragePolicy
# Loop through all of the Storage Policies
Foreach ($Policy in $StoragePolicies) {
# Create a path for the current Policy
$FilePath = "/Users/Admin/SPBM/"+$Policy.Name+".xml"
# Remove any spaces from the path
$FilePath = $FilePath -Replace (' ')
# Export (backup) the policy
Export-SpbmStoragePolicy -StoragePolicy $Policy -FilePath $FilePath
}
Importing or restoring all storage policy XML files that reside in a single directory
# Recover the Policies in /Users/Admin/SPBM/ $PolicyFiles = Get-ChildItem “/Users/Admin/SPBM/” -Filter *.xml
# Enumerate each policy file found
Foreach ($PolicyFile in $PolicyFiles) {
# Get the Policy XML file path
$PolicyFilePath = $PolicyFile.FullName
# Read the contents of the policy file to set variables
$PolicyFileContents = [xml](Get-Content $PolicyFilePath)
# Get the Policy’s name & description $PolicyName = $PolicyFileContents.PbmCapabilityProfile.Name.’#text’
$PolicyDesc = $PolicyFileContents.PbmCapabilityProfile.Description.’#text’
# Import the policy
Import-SpbmStoragePolicy -Name $PolicyName -Description $PolicyDesc -FilePath $PolicyFile
}
When restoring a collection of storage policies to a newly built vCenter server, it will make most sense to restore them at the earliest possible convenience so that vSAN has the abilities to associate the objects with the respective storage policies.
Recommendation: Introduce some scripts to automate the process of regularly exporting your storage policies to a safe location, on a regular basis. It is a practice that is highly recommended for the vDS' managed by vCenter, and should be applied to storage policies as well.
Summary
Exporting storage policies is an optional safeguard to make the process of introducing a new vCenter server to an existing vSAN cluster easier. The effort to streamline this protection up front will make the steps for replacing a vCenter server more predictable and easier to document internal runbooks for such events.
Protecting vSphere Distributed Switches Powering vSAN
Virtual switches and the physical uplinks that are associated with them are the basis for connectivity in a vSAN powered cluster. Connectivity between hosts is essential for vSAN clusters since the network is the primary storage fabric, as opposed to three-tier architectures that may have a dedicated storage fabric.
VMware recommends the use of vSphere Distributed Switches (VDS) for vSAN. Not only do they provide additional capabilities to the hosts, they also provide a level of consistency, as the definition of the vSwitch and the associated port groups are applied to all hosts in the cluster. Since a vDS is a management construct of vCenter, it is recommended to ensure these are protected properly, in the event of unknown configuration changes, or if vCenter server is being recreated and introduced to an exist ing vSAN cluster.
Procedures
The specific procedures for exporting, importing, and restoring VDS configurations can be found at "Backing Up and Restoring a vSphere Distributed Switch Configuration" at docs.vmware.com. The process for each respective task is quite simple, but it is advised to become familiar with the process, and perhaps experiment with a simple export and restore in a lab environment to become more familiar with the task. This will help minimize potential confusion for when it is needed most. Inspecting the data.xml file included in the zip file of the backup can also provide a simple way to review the settings of the vDS.
Recommendation: The VDS export option provides the ability to export just the vDS, or the vDS and all port groups. You may find it helpful to perform the export twice, using both options. This will allow for maximum flexibility in any potential restoration activities.
Similar to the protection of vSAN storage policies, automating this task would be ideal. A sample import and export script can be found in the samples section of code.vmware.com at: https://code.vmware.com/samples/1120/import-export-vdswitch?h=vds
vDS can apply to more than one cluster. In any type of scenarios in which the VDS is restored from a backup, it will be important for the administrator to understand what clusters and respective hosts it may impact. Understanding this clearly will help minimize the potential impact of unintended consequences, and may also influence the naming/taxonomy of the vDS used by an organization, as the number of clusters managed by vCenter continues to grow.
Summary
Much like VMware's guidance for protecting storage policies contained in vCenter, VMware recommends all vSphere Distributed Switches are protected in some form or another. Ideally, this should occur in an automated fashion at regular intervals to ensure the backups are up to date.
Section 12: Upgrade Operations
Upgrading and Patching vSAN Hosts
Upgrading and patching vSAN hosts is very similar to the process for vSphere hosts using traditional storage. The unique role that vSAN plays means there are additional considerations to be included in operational practices to ensure predictable results.
vSAN is a cluster-based storage solution. Each ESXi host in the vSAN cluster participates in a manner that provides a cohesive, single storage endpoint for the VMs running in the cluster. Since it is built directly into the hypervisor, ESXi depends heavily on the expected interaction between hosts to provide a unified storage system. This can be dependent on consistency of the following:
- The version of ESXi installed on the host.
- Firmware versions of key components, such as storage controllers, NICs, and BIOS.
- Driver versions (VMware-provided “inbox” or vendor-provided “async”) for the respective devices.
Inconsistencies between any or all of these may change the expected behavior between hosts in a cluster. Therefore, avoid mixing vSAN/ESXi versions in the same cluster for general operation. Limit inconsistency of versions on hosts to the cluster upgrade process, where updates are applied one host at a time until complete. The Knowledge Base contains additional recommendations on vSAN upgrade best practices.
Recommendation: Subscribe to the VCG Notification service in order to stay informed of changes with compatibility and support of your specified hardware and the associated firmware, drivers, and versions of vSAN. For more information, see the blog post "vSAN VCG Notification Service - Protecting vSAN Hardware Investments."
Method of updates
In versions prior to vSphere 7, the vSphere Update Manager (VUM) was the primary delivery method for updating vSphere and vSAN clusters. VUM centralizes the updating process, and vSAN’s integration with VUM allows for updates to be coordinated with the HCL and the vSphere Release Catalog so that it only applies the latest version of vSAN that is compatible with the hardware. VUM also handles updates of firmware and drivers for a limited set of devices. For more information on the steps that VUM takes to update these devices, see the section “Using VUM to Update Firmware on Selected Storage Controllers.” Note that NIC drivers and firmware as well as BIOS firmware are not updated by VUM, nor monitored by the vSAN health UI, but they play a critical role in the functionality of SAN.
Note that VUM is being deprecated, with the intention of the much more powerful vSphere Lifecycle Manager acting as it's replacement. See the vCenter Server 7.0 U3 Release Notes for more details.
While there have been recent additions to a very limited set of NIC driver checks (as described in “Health Checks—Delivering a More Intelligent vSAN and vSphere”), firmware and driver updates using VUM is a largely manual process. Ensure that the correct firmware and drivers are installed, remain current to the version recommended, and are a part of the cluster lifecycle management process.
vSphere 7 introduced the new VMware vSphere Lifecycle Manager (vLCM). This is an entirely new solution for unified software and firmware management that is native to vSphere. vSphere Lifecycle Manager, or vLCM is the next-generation replacement to vSphere Update Manager (VUM), and is built off of a desired-state, or declarative model, and will provide lifecycle management for the hypervisor and the full stack of drivers and firmware for the servers powering your data center. vLCM is a powerful new approach to creating simplified consistent server lifecycle management at scale. vLCM was built with the needs of vendors in mind. VUM and vLCM coexist in versions vSphere 7 and vSphere 7 U1, with the intention of vLCM eventually being the only method of software lifecycle management for vSphere and vSAN.
vLCM has the capability of delivering updates beyond the scope of VUM, including NIC drivers, etc. As of vSphere 7 U1, they are not a part of the validation and health check process.
Recommendation: Focus on efficient delivery of services during cluster updates as opposed to speed of update. vSAN restricts parallel host remediation. A well-designed and -operating cluster will seamlessly roll through updating all hosts in the cluster without interfering with expected service levels.
Viewing vSphere/vSAN version levels and consistency
Hypervisor versions and patch levels can be found in a number of different ways.
vCenter—Version information can be found by clicking on the respective hosts within the vSAN cluster and viewing the “Summary” or “Updates” tab
FIGURE 12-1: Viewing a hypervisor version of a host in vCenter
PowerCLI—PowerCLI can be used to fetch a wide variety of host state information, as well as provide a vehicle for host patch remediation and verification. The “PowerCLI Cookbook for vSAN” offers practical examples (on page 50) of patch management with VUM using Power CLI.
A PowerCLI script called VCESXivSANBuildVersion.ps1 will provide version level information per cluster.
FIGURE 12-2: Viewing a hypervisor version of multiple hosts using PowerCLI
Aria Operations—Aria Operations includes a “Troubleshoot vSAN” dashboard which enumerates the hosts participating in a selected vSAN cluster, and provides some basic version level information.
FIGURE 12-3: Viewing host details of a vSAN cluster in Aria Operations
Upgrades and host restart times
The host upgrade process may consist of one or more host reboots, especially as firmware and driver updates become more common in the upgrade workflow. A host participating in a vSAN cluster typically takes longer to restart than non-vSAN hosts, as vSAN digests log entries in the buffer to generate all required metadata tables, but has been improved significantly in vSAN 7 U1. This activity is visible in the DCUI. The default vSAN data-migration option when placing a host into maintenance mode manually or using VUM is “Ensure accessibility.” This minimizes data movement during the maintenance process to ensure data remains accessible but less resilient, and is typically the most appropriate option to use for most maintenance scenarios.
Host restart times can be accelerated through the ESXi Suspend-to-memory feature found in the Quick Boot functionality for vLCM. This feature, introduced in vSAN 7 U2, is only available for clusters using vLCM. While it is not applicable for all conditions, it can improve cluster updated times dramatically. For more information, see Introducing vLCM into an Existing vSAN Environment.
Recommendation: Upgrade to vSAN 7 U2. This edition feature "durability components" that help ensure the durability of data during planned and unplanned events. An ancillary benefit to this capability is that resynchronizations are completed much more quickly after a host exists out of maintenance mode.
Long host restart times are largely a byproduct of the Original Storage Architecture (OSA) in vSAN. The Express Storage Architecture (ESA) is built differently, and does not demonstrate this same behavior.
vSAN will hold off on rebuilding any data to meet storage policy compliance for 60 minutes. Depending on the updates being performed, and the host characteristics, temporarily increasing the “Object Repair Timer” may reduce resynchronization activity and make the host update process more efficient.
FIGURE 12-4: Setting the Object Repair Timer at the vSAN cluster level in vCenter
It is recommended that the Object Repair Timer remain at the default of 60 minutes in most cases, but it can be changed to best meet the needs of the organization.
Health Findings
The vSAN health service provides a number of health findings (renamed from "health checks" in vSphere 8 U1) to ensure the consistency of the hypervisor across the cluster. This is an additional way to alert for potential inconsistencies. The vSAN health findings may also show alerts of storage controller firmware and driver inconsistencies. Notable health checks relating to updates include:
- Customer Experience Improvement Program (necessary to send critical updates regarding drivers and firmware)
- vCenter server up-to-date
- vSAN build recommendation engine health
- vSAN build recommendation
- vSAN release catalog
- vCenter state is authoritative
- vSAN software version compatibility
- vSAN HCL DB up-to-date
- SCSI controller is VMware certified
- Controller is VMware certified for ESXi release
- Controller driver is VMware certified
- Controller firmware is VMware certified
- vSAN firmware provider health
- vSAN firmware version recommendation
Recommendation: Keep vCenter server running the very latest version, regardless of the version of vSAN clusters it is running. vSphere and vSAN hosts require a version of vCenter that is equal to or newer than the hosts managed. Updating the hosts without updating the vCenter server may lead to unexpected operational behaviors. Running the latest version of vCenter will also provide environments running multiple clusters of vSAN to phase in the latest version per cluster, in a manner and time frame that works best for the organization.
Host updates may take a while to complete the full host restart and update process. Using the DCUI during host restarts can help provide better visibility during this stage. See the section “Restarting a Host in Maintenance Mode” for more details.
Disk Format Conversion
Some vSAN features require a minimum vSAN version and a minimum disk format version. You might be required to upgrade the disk format before using a new vSAN feature. This is known as a disk format conversion (DFC). Older versions of vSAN sometimes require the evacuation of data from a drive before the DFC is completed. The evacuation of data can take a significant amount of time, which means the DFC can be a lengthy process. This is less common with recent versions of vSAN meaning the process is less disruptive and much faster. For more detail on On-Disk and Object format upgrades in vSAN, see the post: "Upgrading On-Disk and Object Formats in vSAN."
Recommendation: Review this VMware Knowledge Base article prior to performing a vSphere/vSAN upgrade: vSAN upgrade requirements
Supplemental VMware Knowledge Base articles:
- Understanding vSAN on-disk format versions and compatibility
- Build numbers and versions of VMware vSAN
- VMware vSAN Upgrade Best Practices
Summary
A vSAN cluster, not the individual hosts, should be viewed as the unit of management. In fact, with the all new vLCM, this is how it treats the upgrade process. During normal operations, all vSAN host s that compose a cluster should have matching hypervisor, driver, and firmware versions. Host upgrades and patches should be performed per cluster to provide consistency of hypervisor, firmware, and driver versions across the hosts in the cluster.
Upgrade considerations for 2-Node and Stretched Cluster topologies
When upgrading a cluster, vSAN 2-node and stretched cluster topologies behave in a very similar way to standard vSAN clusters. The vCenter server is updated, followed by a rolling upgrade through the respective hosts in the vSAN cluster. One of the most significant differences between these 2-node and stretched clusters is the use of a virtual witness host appliance. A standard vSAN cluster does not use this type of an arrangement.
For both 2-node and stretched clusters, as of vSAN 7 U1, the witness host appliance should be upgraded prior to upgrading the hosts in the cluster that uses the witness host. This helps maintain backward compatibility and is an operational change from past versions of vSAN.
Another special consideration to keep in mind when it comes to 2-node clusters with a shared witness host is the on-disk format upgrade. The on-disk format upgrade in a vSAN environment with multiple 2-node clusters and a shared witness should only be performed once all the hosts sharing the same witness are running the same vSphere/vSAN version. See Upgrading vSAN 2-node Clusters with a Shared Witness from 7U1 to 7U2 for more details.
In vSAN 7 U3, vLCM supports the lifecycle management of the witness host appliance, and will upgrade it in the correct order with respect to the rest of the hosts in the cluster.
Upgrade considerations when using vSAN HCI with Datastore sharing
In versions prior to vSAN 7 U1, updating hosts in a vSAN cluster typically had an impact-domain of just the VMs powered by the given cluster. In vSAN 7 U1 and later, vSAN HCI with datastore sharing (previously known in the product as "HCI Mesh") was introduced, which allows the ability to borrow storage capacity resources so that a VM using compute resources on one vSAN cluster can use the storage resources of another cluster. The Skyline Health Check includes several checks to ensure that cluster satisfies all prerequisites for vSAN HCI with datastore sharing. One of the relevant checks includes is referred to as: "vSAN format version supports remote vSAN."
Recommendation: While not required, it may be best to factor in clusters using vSAN HCI with datastore sharing, and coordinate the updates of clusters with the relationships to one another. Determining the timing and order of these cluster upgrades may improve the update experience.
The upgrade process may also be a good time to ensure that your HA settings for the cluster are correctly configured. vSAN HCI with datastore sharing has special requirements to ensure APD events are handles as expected.
Multi-Cluster Upgrading Strategies
While VMware continues to introduce to vSAN all new levels of performance, capabilities, robustness, and ease of use, the respective vSAN clusters must be updated to benefit from these improvements. While the upgrading process continues to be streamlined, environments running multiple vSAN clusters can benefit from specific practices that will deliver a more efficient upgrade experience.
vCenter server compatibility
In a multi-cluster environment, vCenter server must be running the version equal to, or greater than the version to be installed on any of the hosts for the clusters it manages. Ensuring that vCenter server is always running the very latest edition will guarantee compatibility among all potential host versions running in a multi-cluster arrangement, and introduce enhancements to vCenter that independent from the clusters it is managing.
Recommendation: Periodically check that vCenter is running the latest edition. The vCenter Server Appliance Management Interface (VAMI) can be accessed using https://vCenterFQDN:5480.
Phasing in new versions of vSAN
As noted in the “Upgrading and Patching vSAN Hosts” section, vSAN is a cluster-based solution. Therefore, upgrades should be approached per cluster, not per host. With multi-cluster environments, IT teams can phase in a new version of vSAN per cluster to meet any of their own vetting, documentation, and change control practices. Similar to common practices in application maintenance, upgrades can be phased in on less critical clusters for testing and validation prior to rolling out the upgrade into more critical clusters.
FIGURE 12-5: Phasing in new versions of vSAN per cluster
Cluster update procedures are not just limited to hypervisor upgrades, but should also include firmware and drivers for NICs, storage controllers, and BIOS versions. See “Upgrading Firmware and Drivers for NICs and Storage Controllers” and “Using VUM to Update Firmware on Selected Storage Controllers” for more detail. Recommendation: Update to the very latest version available. If a cluster is several versions behind, there is no need to update the versions one at a time. The latest edition has more testing and typically brings a greater level of intelligence to the product and the conditions it runs in.
Parallel upgrades
While vSAN limits the upgrade process to one host at a time within a vSAN cluster, cluster upgrades can be performed concurrently if desired. In fact, as of vSAN 7 U1, vLCM supports up to 64 concurrent cluster update activities. This can speed up host updates across larger data centers. Whether to update one cluster or multiple clusters at a time is at your discretion based on understanding tradeoffs and your procedural limitations.
Updating more hosts simultaneously should be factored into the vSAN cluster sizing strategy. More clusters with fewer hosts allows for more parallel remediation than fewer clusters with more hosts. For example, an environment with 280 hosts could cut remediation time in half if the design was 20 clusters of 14 hosts each, as opposed to 10 clusters of 28 hosts each.
Since a vSAN cluster is its own discrete storage system, administrators may find greater agility in operations and troubleshooting. “vSAN Cluster Design—Large Clusters Versus Small Clusters” discusses the decision process of host counts and cluster sizing in great detail.
Larger environments with multiple vSAN clusters may have different generations of hardware. Since drivers and firmware can cause issues during an update process, concurrent cluster upgrades may introduce operational challenges to those managing and troubleshooting updates. Depending on the age and type of hardware, a new version of vSAN could be deployed as a pilot effort to a few clusters individually, then could be introduced to a larger number of clusters simultaneously. Determine what level of simultaneous updates is considered acceptable for your own organization.
Recommendation: Focus on efficient delivery of services during cluster updates, as opposed to speed of update. vSAN restricts parallel host remediation. A well-designed and -operating cluster will seamlessly roll through updating all hosts in the cluster without interfering with expected service levels.
Why are vSAN clusters restricted to updating one host at a time? Limiting to a single host per cluster helps reduce the complexity of subtracting not only compute resources but storage capacity and performance. Factoring in available capacity in addition to compute resources is unique to an HCI architecture. Total available host count can also become important for some data placement policies such as an FTT=3 using mirroring, or an FTT=2 using RAID-6 erasure coding. Limiting the update process to one host at a time per cluster also helps avoid this complexity, while reducing the potential need for data movement due to resynchronization.
See the "VMware KB 2146381 - VMware vSAN Upgrade Best Practices" for more information on how to successfully upgrade a vSAN cluster.
Summary
For data centers, the availability of services generally takes precedent over everything else. Environments consisting of multiple vSAN clusters can take advantage of its unique, modular topology by phasing in upgrades per cluster to the hypervisor, as well as any dependent hardware updates including storage controllers, NICs, and BIOS versions.
Upgrading Large vSAN Clusters
Standard vSAN clusters can range from 3 to 64 hosts. Since vSAN provides storage services per cluster, a large cluster is treated in the same way as a small cluster: as a single unit of services and management. Maintenance should occur per cluster and is sometimes referred to as a “maintenance domain.”
Upgrading vSAN clusters with a larger quantity of hosts is no different than upgrading vSAN clusters with a smaller quantity of hosts. In addition, those described in “Upgrading and Patching vSAN Hosts,” there are a few additional host upgrade considerations to be mindful of during these update procedures.
FIGURE 12-6: Visualizing the “maintenance domain” of a single large vSAN cluster
vLCM and VUM are limited to updating one host at a time in a vSAN cluster. The length of time for the cluster to complete an update is proportional to the number of hosts in a cluster. To upgrade more than one host at a time, reduce the size of the maintenance domain by creating more clusters comprising fewer hosts. This smaller maintenance domain will allow for more hosts (one per cluster) to perform parallel upgrades.
Designing an environment that has a modest maintenance domain is one of the most effective ways to improve operations and maintenance of a vSAN-powered environment. For more information on this approach, see the topic “Multi-Cluster Upgrading Strategies.”
While no more than one host per vSAN cluster can be upgraded at a time, there are some steps that can be taken to potentially improve the upgrade speed.
- Use hosts that support the new Quick Boot feature. This can help host restart times. Since hosts in a vSAN cluster are updated one after the other, reducing host restart times can significantly improve the completion time of the larger clusters.
- Update to vSAN 7 U1. The host restart times in vSAN 7 U1 have improved dramatically over previous versions.
- If a large cluster has relatively few resources used, an administrator may be able to place multiple hosts into maintenance mode safely without running short of storage and capacity resources. Updates will still occur one host at a time, but this may save some time placing the respective hosts into maintenance mode. This would only be possible in large clusters that are underused, and actual time savings may be negligible.
Recommendation: Focus on efficient delivery of services during cluster updates, as opposed to speed of update. vSAN restricts parallel host remediation of hosts. A well-designed and -operating cluster will seamlessly roll through updating all hosts in the cluster without interfering with expected service levels.
Larger vSAN clusters may better absorb reduced resources as a host enters maintenance mode for the update process. Proportionally, each host contributes a smaller percentage of resources to a cluster. Large clusters may also see slightly less data movement than much smaller clusters to comply with the “Ensure accessibility” data migration option when a host is entered into maintenance mode. For more information on the tradeoffs between larger and smaller vSAN clusters, see “vSAN Cluster Design—Large Clusters Versus Small Cluster” on core.vmware.com.
See the "VMware KB 2146381 - VMware vSAN Upgrade Best Practices" for more information on how to successfully upgrade a vSAN cluster.
Note that the ESA may be better at handling upgrades of large cluster sizes. This is due to a collection of enhancements that allow vSAN to complete the update process on each host faster on average than the hosts running the OSA.
Summary
Upgrading a vSAN cluster with a larger quantity of hosts is no different than updating a vSAN cluster with a smaller quantity of hosts. Considering that the update process restricts updates to occur one host at a time within a cluster, an organization may want to revisit their current practices to cluster sizing, and how hosts themselves can be optimized by using new features such as Quick Boot while also running vSAN 7 U1 or newer.
Upgrading Firmware and Drivers for NICs and Storage Controllers
Outdated or mismatched firmware and drivers for NICs and Storage Controllers can impact VM and or vSAN I/O handling. While VUM handles updates of firmware and drivers for a limited set of devices, firmware and driver updates remain a largely manual process. Whether installed directly on an ESXi server from the command line or deployed using VUM, ensure the correct firmware and drivers are installed, remain current to the version recommended, and are a part of the cluster lifecycle management process.
vLCM strives to simplify the coordination of firmware and driver updates for select hardware. It has a framework to coordinate the fetching of this software from the respective vendors for the purposed of building a single desired state image for the hosts. As of vSAN 7 U1, three vendors provide the Hardware Support Manager (HSM) plugin that helps coordinate this activity. Those servers vendors are Dell, HPE, and Lenovo. VUM was unable to do any type of coordination like this.
For environments still running VUM, it is recommended to verify that vSAN supports the software and hardware components, drivers, firmware, and storage I/O controllers that you plan on using. Supported items are listed on the VMware Compatibility Guide.
Recommendation: Subscribe to the VCG Notification service in order to stay informed of changes with compatibility and support of your specified hardware and the associated firmware, drivers, and versions of vSAN. For more information, see the blog post "vSAN VCG Notification Service - Protecting vSAN Hardware Investments."
Summary
When it comes to vSAN firmware and drivers, consistency and supportability are critical. Always reference the vSAN Compatibility Guide for guidance on specific devices. Also, whether updating via command line or using VUM, be sure to maintain version consistency across the cluster. vLCM will make this operational procedure easier for eligible servers that support this new framework.
Introducing vLCM into an Existing Environment
The VMware vSphere Lifecycle Manager (vLCM) is VMware's all new framework that provides unified software and firmware management for vSphere hosts. It helps remove the complexity of managing updates to the hypervisor, server component firmware, and server component drivers, and does so in a coordinated and consistent manner. It is the next-generation replacement to vSphere Update Manager, commonly known as VUM.
Due to the prerequisites for vLCM, transitioning clusters managed by VUM over to vLCM should be done with care and consideration. Some of those considerations include:
- vLCM and VUM will coexist, but are mutually exclusive. One or the other is chosen to be used per cluster, but not both.
- Once a cluster is transitioned to use vLCM, it cannot be transitioned back to use VUM.
- vLCM clusters assume much more homogeneous hardware specifications. vCenter Server can house multiple vLCM desired-state images, but only one desired-state image can be defined and used per cluster. This desire image must be able to apply to all hosts in the cluster thus the need for similar hardware within a cluster.
- vLCM can update the hypervisor in clusters using any hardware on the HCL, but it only has hardware-level coordination if using Hardware Support Manager (HSM): the manufacturer plugin that allows for integration into the vendor repositories. As of vSAN 8 U1 there are plugins for Cisco, Dell, Fujitsu, Hitachi, HPE, and Lenovo.
Recommendations in Introducing vLCM into your environment
Introducing a new method of lifecycle management as sophisticated as vLCM lends well to the idea of a slow, methodical approach. The recommendations listed below will improve the transition to vLCM in your environment.
- Plan on running the latest edition of vSphere and vSAN. vSphere 7 U2 and vSphere 7 U3 introduced several new features to help streamline the use of vLCM during the initial phase of a cluster build, including the initial bootstrapping step through vSAN Easy Install, and new cluster deployment and expansion through Cluster Quick Start. Ensuring that the hosts that comprise a cluster are fully updated with the appropriate version of firmware, drivers, and hypervisor version ensures smooth operation once the cluster is in production. vLCM in vSphere 7 U3 also introduced the ability to identify and patch storage device firmware in addition to the storage controllers, as well as managing the lifecycle of vSAN witness host appliances in certain configurations.
- Download and install the latest vendor plugin(s). This is the first step in establishing the intelligence between vLCM, the respective vendor repositories, VMware, and the hosts in your cluster. Depending on the server vendor, the configuration and operation of the plugin may vary slightly. VMware built this flexibility with the vendors in mind to best accommodate their repositories, and capabilities.
- Start with a cluster using the newest hardware. Newer ReadyNodes are more likely to be capable of supporting end-to-end updates than older servers. A single vCenter server can manage several clusters - with some using VUM, and with others using vLCM. This will allow for a slow, methodical adoption in your environment.
- Start with a cluster using servers by a manufacturer who provides a vLCM plugin / HSM. As of vSAN 8 U1 there are plugins for Cisco, Dell, Fujitsu, Hitachi, HPE, and Lenovo.This will allow you to experience, and experiment with more complete lifecycle management of the host.
- Test in a small cluster to become familiar with process. This will allow you to become familiar with the requirements and the workflow for easy operationalization.
- Enable only in clusters with hosts from the same server manufacturer. vLCM requires a more homogeneous collection of hosts than VUM. This is the primary reason to strive for homogeneous clusters in your vSAN cluster design.
- Use the VCG notification service. The VMware vSAN VCG Notification Service can play an important role in keeping up to date on server hardware, and the compatibility with the version of the hypervisor, firmware, and drivers. To ease the process of determining vLCM compatible hosts, look for an additional attribute recently added to the VCG that denotes if a ReadyNode is “vLCM capable.”
- Build your desired image with multiple models (from the same vendor) in mind. While vLCM is limited to a single desired-state image per cluster, and this desired image can only be created by using one server manufacturer, the image can have drivers and firmware for different models from the server manufacturer. This approach will help ensure that all hosts in the cluster can be updated even if a cluster consists of hosts with some variations in device types or even an update specification of a ReadyNode.
- Ensure your operational run-books for server updates are updated to reflect vLCM. With VUM it was not unusual to manually install the firmware, along with perhaps installing VIBs at a command line after a major upgrade to the hypervisor. Since a vLCM enabled cluster manages this differently, applying those same practices may incorrectly introduce "drift" (move away from the desired state) into a cluster.
Note that while vLCM provides the ability to update the full host stack (including BIOS, NICs, etc.), the only HCL validation that occurs is for storage controllers.
Wider-Scale Adoption
As your organization become more comfortable with vLCM, then other operations should be evaluated.
- Using vLCM on other supported topologies. vSAN 7 U2 supports the use of vLCM on standard vSAN clusters, clusters using explicit fault domains, 2-Node clusters, and stretched clusters. Note for vSAN 7 U3, updates to the virtual witness host appliance is supported in vLCM when the witness host appliance is used in a stretched cluster topology, as well as a 2-node topology with a dedicated witness host appliance. For 2-node topologies with a shared witness host appliance, the updating of the appliance will need to occur independent from the lifecycle management offered by vLCM.
- Look into parallel remediation of clusters. vSAN 7 U1 and later can support up to 64 concurrent remediations of clusters. For larger environments, this can be extremely helpful in large environments, or perhaps environments with a large quantity of 2-Node clusters.
- Test and review operationalization of vLCM with older hardware or servers by vendors that do not have a vLCM plugin. While vLCM may have limitations in the end-to-end lifecycle management of the hosts, it will still be capable of updating vSphere. Depending on the circumstances of those older clusters, one may choose to simply use VUM for a time, and review the decision again at a later date.
- Consider if the new support of ESXi Suspend-to-memory option within Quick Boot will be appropriate for the workloads on a cluster. The Suspend-to-memory option introduced in vSAN 7 U2 is an exciting new feature that when the conditions allow for its use, can dramatically improve the rate in which hosts are updated through vLCM. VMs running on host are suspended to memory after the host completes the initialization of the hypervisor. Suspending VMs to memory may lead to VM/application interruption depending upon the length of the suspend > resume, and is not appropriate for all workload types.
The benefits vLCM brings show up most when scale, consistency, and frequency of updates are top-of-mind. Clusters are managed by a single desired-state image which helps ensure consistency and reduces the obstacles typically associated with the lifecycle management process.
Recommendation: In some cases where there may be a variety of server vendors in the same environment, HCI Mesh can be used to accommodate the transition of some clusters to vLCM. For example, in environments using VUM, a host from any vendor could be decommissioned from one cluster and moved over to another cluster using another vendor. With the need for homogeneous configurations with vLCM, one could instead keep the clusters the same, but use HCI Mesh to borrow storage resources.
More Information
See the vSAN FAQs for common questions related to vLCM, and go to "Lifecycle Management with vLCM in vSAN 7 U1" for the latest updates to what is new with vLCM in vSphere and vSAN 7 U1. For a video demonstration of vLCM when using Dell servers, see this video. For a video demonstration of vLCM when using HPE servers, see this video. For step-by-step instructions for various aspects of vLCM, be sure to visit our vLCM content on docs.vmware.com.
Summary
While vLCM is a powerful new solution to help administrators with lifecycle management of their vSphere hosts, a methodical approach in the transition to its use will minimize disruption in operational procedures, and ensure a simplified and predictable result.
Section 13: vSAN Capacity Management
Observing Storage Capacity Consumption Over Time
An increase in capacity consumption is a typical trend that most data centers see, regardless of the underlying storage system used. vSAN offers a number of different ways to observe changes in capacity. This can help with understanding the day-to-day behavior of the storage system, and can also help with capacity forecasting and planning.
Options for visibility
Observing capacity changes over a period of time can be achieved in two ways: vCenter and VMware Aria Operations.
FIGURE 13-1: Displaying capacity history for a vSAN cluster in vCenter and in VMware Aria Operations
Both provide the ability to see capacity usage statistics over time and the ability to zoom into a specific time window. Both methods were designed for slightly different intentions, and have different characteristics.
Capacity history in vCenter:
- Natively built into the vCenter UI and easily accessible.
- Can show a maximum of a 30-day window.
- Data in performance service retained for 90 days, and this retention period is not guaranteed.
- Data will not persist if vSAN performance service is turned off then back on.
Capacity history in Aria Operations
- Much longer capacity history retention periods, per configuration of Aria Operations.
- While Aria Operations requires the vSAN performance service to run for data collection, the capacity history will persist if the vSAN performance service is turned off then back on.
- Able to correlate with other relevant cluster capacity metrics, such as CPU and memory capacity.
- Can view aggregate vSAN cluster capacity statistics.
- Breakdowns of capacity usage with and without DD&C.
- Requires Aria Operations Advanced licensing or above.
The vSAN capacity history in vCenter renders the DD&C ratio using a slightly different unit of measurement than found in the vSAN capacity summary in vCenter and in Aria Operations. The capacity summary in vCenter and Aria Operations displays the savings as a ratio (e.g., 1.96x) whereas the vSAN capacity history renders it at a percentage (e.g., 196%). Both are accurate.
Also note that vCenter’s UI simply states, “Deduplication Ratio.” The number presented actually represents the combined savings from DD&C.
Recommendation: Look at the overall capacity consumed after a storage policy change, rather than simply a DD&C ratio. Space efficiency techniques like erasure codes may result in a lower DD&C ratio, but actually increase the available free space. For more information on this topic, see “Analyzing Capacity Utilization with Aria Operations” in the operations guidance for Aria Operations and Aria Operations for Logs in vSAN Environments on core.vmware.com.
Summary
Capacity usage and history can be easily found in both vCenter Server and Aria Operations. An administrator can use one or both tools to gain the necessary insight for day-to-day operations, as well as capacity planning and forecasting.
Observing Capacity Changes as a Result of Storage Policy Adjustments or EMM Activities
Some storage policy definitions will affect the amount of storage capacity consumed by the objects (VMs, VMDKs, etc.) that are assigned the policy. Let’s explore why this happens, and how to understand how storage capacity has changed due to a change of an existing policy, or assignment of VMs to a new policy.
Policies and their impact on consumed capacity
vSAN is unique when compared to other traditional storage systems: it allows configuring levels of resilience (e.g., FTT) and the data placement scheme (RAID-1 mirroring or RAID-5/6 erasure coding) used for space efficiency. These configurations are defined in a storage policy and assigned to a group of VMs, a single VM, or even a single VMDK.
FIGURE 13-2: Understanding how a change in storage policy will affect storage capacity
Changes in capacity as a result of storage policy adjustments can be temporary or permanent.
- Temporary space is consumed when a policy changes from one data placement approach to another. It builds a new copy (known as resynchronization) of that data to replace the old copy and comply with the newly assigned policy. (VM using a RAID-1 mirror to a RAID-5 erasure code would result in space used to create a new copy of the data using a RAID-5 scheme.) Once complete, the copy of the data under the RAID-1 mirror is deleted, reclaiming the temporary space used for the change. See the topic “Storage Policy Practices to Improve Resynchronization Management in vSAN” for a complete list of storage policies that impact data placement.
- Permanent space is consumed when applying a storage policy using a higher FTT level (FTT=1 to FTT=2), or from an erasure code to a mirror (e.g., RAID-5 to RAID-1). The effective capacity used occurs after the change in policy has been completed (using temporary storage capacity), and remains for as long as that object is assigned to the given storage policy. The amount of temporary and permanent space consumed for a storage policy change is a reflection of how many objects are changed at the same time, and the respective capacity used for those objects. The temporary space needed is the result of resynchronizations. See the topic “Storage Policy Practices to Improve Resynchronization Management in vSAN” for more information.
Due to the prescriptive nature of storage policies, vSAN presents the raw capacity provided by the datastore, as observed in vCenter, Aria Operations, and PowerCLI.
Estimating usage
The vSAN performance service provides an easy-to-use tool to help estimate free usable capacity given the selection of a desired policy. Simply select the desired storage policy, and it will estimate the free amount of usable capacity with that given policy. It does not account for the free space needed for transient operations as recommended by VMware.
FIGURE 13-3: The free capacity with policy calculator in the vSAN UI found in vCenter
Observing changed usage as a result of storage policy changes
There are multiple options for providing visibility into storage capacity changes. See the topic “Observing Storage Capacity Consumption Over Time” for more information. The following illustrates how observing capacity changed via storage policy changes is achieved by using vCenter Server and Aria Operations.
In this example, a group of VMs using a storage policy using an FTT=1 via a RAID-1 mirror were changed to another storage policy using an FTT=1 via a RAID-5 erasure coding scheme. In vCenter, highlighting a vSAN cluster and selecting Monitor → vSAN → Performance → Backend will reveal the resynchronization activity that has occurred as a result of the policy change, as shown below.
FIGURE 13-4: Observing resynchronization I/O activity as a result of a change in storage policies
When looking at the capacity history in vCenter, the policy change created a temporary use of more space to build the new RAID-5 based objects. Once the resynchronization is complete, the old object data is removed. DD&C begins to take effect, and free capacity is reclaimed. FIGURE 13-5 below shows how this is presented in vCenter.
FIGURE 13-5: Using vCenter to observe cluster capacity use as a result of a resynchronization event
The Cluster Utilization widget in the vSAN capacity overview dashboard found in Aria Operations shows the same results. Aria Operations will offer additional details via context sensitive “sparklines” that will give precise breakdowns of DD&C savings and storage use with and without DD&C. FIGURE 13-6 below shows how this is presented in Aria Operations.
FIGURE 13-6: Using Aria Operations to observe cluster capacity use as a result of a resynchronization event Note that different views may express the same data differently due to three reasons:
- Limits on the window presented on the X axis
- Different values on the Y axis
- Different scaling for X and Y values
This is the reason why the same data may visually look different, even though the metrics are consistent across the various applications and interfaces.
Recommendation: Look at the overall capacity consumed after a storage policy change, rather than simply a DD&C ratio. Space efficiency techniques like erasure codes may result in a lower DD&C ratio, but may actually improve space efficiency by reducing consumed space.
Summary
Storage policies allow an administrator to establish various levels of protection and space efficiency across a selection of VMs, a single VM, or even a single VMDK. Assigning different storage policies to objects impacts the amount of effective space they consume across a vSAN datastore. Both vCenter and VMware Aria Operations provide methods to help the administrator better understand storage capacity consumption across the vSAN cluster.
Estimating Approximate “Effective” Free/Usable Space in vSAN Cluster
With the ability to prescriptively assign levels of protection and space efficiency through storage policies, the amount of capacity a given VM consumes in a vSAN cluster is subject to the attributes of the assigned policy. While this offers an impressive level of specificity for a VM, it can make estimating free or usable capacity of the VM more challenging. Recent editions of vSAN offer a built-in tool to assist with this effort.
The vSAN performance service provides an easy-to-use tool to help estimate available free usable capacity given the selection of a desired policy. Simply select the desired storage policy, and it will estimate the free amount of usable capacity with that given policy.
FIGURE 13-7: The free capacity with policy calculator in the vSAN UI found in vCenter
The tool provides a calculation for only the free raw capacity remaining. Capacity already consumed is not accounted for in this estimating tool. The estimator looks at the raw capacity remaining, and then applies the traits of the selected policy to determine the effective amount of free space available. Note that it does not account for the free space needed for slack space as recommended by VMware.
For more information, see the topic “Observing Capacity Changes as a Result of Storage Policy Adjustments.”
Recommendation: If you are trying to estimate the free usable space for a cluster knowing that multiple policies will be used, select the policy used that is the least space efficient. For example, if an environment will run a mix of FTT=1 protected VMs, but some use policies with RAID-1, while others use policies with the more space-efficient RAID-5, select the RAID-1 policy in the estimator to provide a more conservative number.
vSAN 7 U2 introduces the ability to see the "oversubscription" ratio, as shown on Figure 13-7. This helps the administrator easily understand the estimated capacity necessary for fully allocated capacity of thin provisioned objects. It factors in the storage policy used, and optionally, deduplication and compression. This can be extremely helpful for organizations who like to maintain a specific oversubscription ratio for their storage capacity.
As capacity utilization grows, administrators need to be properly notified when critical thresholds are met. In vSAN 7 U2, new customizable alerts available to generate notifications. Clicking on the "Learn More" within the UI will direct you to a "About Reserved Capacity" which will describe the actions as a result of reaching these thresholds. These are considered “soft thresholds” meaning that they will enforce some operational changes, but allow critical activities to continue. When reservations are met, health alerts will be triggered to indicate the condition. Provisioning of new VMs, virtual disks, FCDs, linked and full clones, iSCSI targets, snapshots, file shares, etc. will not be allowed when the threshold has been exceeded. (note that thick provisioned disk will fail at the time of creation if it exceeds the threshold, where thin provision disks may prevent expansion). I/O activity for existing VMs will continue as the threshold is exceeded.
FIGURE 13-8: Alarm thresholds in vSAN
Capacity consumption is usually associated with bytes of data stored versus bytes of data remaining available. There are other capacity limits that may inhibit the full utilization of available storage capacity. vSAN has a soft limit of no more than 200 VMs per host, and a hard limit of object components of no more than 9,000 components per host. Certain topology and workload combinations, such as servers with high levels of compute and storage capacity that run low capacity VMs may run into these other capacity limits. Sizing of these capacity considerations should be a part of a design and sizing exercise. See the section topic "Monitoring and Management of vSAN Object Components" for more details.
The new paradigm of Reserved Capacity in more recent editions of vSAN has been described extensively in the post: "Understanding Reserved Capacity Concepts in vSAN." Other helpful links on this topic include the posts: "Demystifying Capacity Reporting in vSAN" and "The Importance of Space Reclamation for Data Usage Reporting in vSAN."
Summary
Due to the use of storage policies and the architecture of vSAN, understanding free usable capacity is different than a traditional architecture. The estimator provided in the vSAN capacity page helps provide clarity on the effective amount of capacity available under a given storage policy.
Demystifying Capacity Reporting in vSAN
Through the power of storage policies, vSAN provides the administrator the ability to granularly assign unique levels of resilience and space efficiency for data on a vSAN datastore. It is an extraordinarily flexible capability for enterprise storage, but it can sometimes cause confusion with interpreting storage capacity usage.
Let's explore how vCenter Server renders the provisioning and usage of storage capacity for better capacity management. This guidance will apply to vSAN powered environments on-premises, and in the cloud using vSAN 7 or newer. For the sake of clarity, all references will be in Gigabytes (GB) and Terabytes (TB), as opposed to GiB and TiB, and may be rounded up or down to simplify the discussion.
Terms in vCenter when using Storage Arrays versus vSAN
When discussing storage capacity and consumption, common terms can mean different things when applied to different storage systems. For this reason, we will clarify what these terms within vCenter Server mean for both vSphere using traditional storage arrays, and vSAN. The two terms below can be found in vCenter Server, at the cluster level, in the VMs tab.
- Provisioned Space. This refers to how much storage capacity is being assigned to the VMs and/or virtual disk(s) by the hypervisor via the virtual settings of the VM. For example, a VM with 100GB VMDK would render itself as 100GB provisioned. A thin-provisioned storage system may not be provisioning that entire amount, but rather, only the space that is or has been used in the VMDK.
- Used space. This refers to how much of the "Provisioned Space" is being consumed with real data. As more data is being written to a volume this used space will increase. It will only decrease if the data is deleted and TRIM/UNMAP reclamation commands are issued inside of the guest VM using the VMDK.
Storage arrays will often advertise their capacity to the hypervisor after a global level of resilience is applied to the array. The extra storage capacity used for resilience and system overhead is masked from the administrator since all the data stored on the array is protected in the same manner and space efficiency level.
vSAN is different in how it advertises the overall cluster capacity and the values it presents in the "Provisioned Space" and "Used Space" categories in the vCenter Server. Let's look at why this is the case.
Presentation of Capacity at the vSAN Cluster Level
vSAN presents all storage capacity available in the cluster in a raw form. The total capacity advertised by a cluster is the aggregate total of all of the capacity devices located in the disk groups of the hosts that comprise a vSAN cluster. Different levels of resilience can be assigned per VM, or even per VMDK thanks to storage policies. As a result, the cluster capacity advertised as available does not reflect the capacity available for data in a resilient manner, as it will be dependent on the assigned policy.
To give a better idea of effective free capacity one has in a cluster, vSAN provides a "What If analysis" to show how much effective free space for a new workload would be using the desired storage policy. As shown in Figure 13-9, while the overall free space on the disks is reported as 4.21TB, the "What if analysis" shows the effective free space of 3.16TB if the VM(s) were protecting using an FTT=1 by using a RAID-5 erasure code. While it only accounts for one storage policy type at a time, this can be a helpful tool in understanding the effective capacity available for use.
Figure 13-9. Capacity Overview and What-if Analysis for a vSAN cluster.
Just as with some storage arrays, vSAN thin provisions the storage capacity requested for a volume. The provisioned space of a VMDK will not be used unless it is, or has been used by the guest OS and will only be reclaimed after data is deleted and TRIM/UNMAP reclamation commands are issued inside of the guest VM using the VMDK.
In the same cluster Capacity view, you will see a “Usage breakdown.” Under VM > VMDK, it is broken down into Primary data, and replica usage. The replica usage will reflect both replica objects used with RAID-1 mirroring and the portion of object data that is parity data for RAID-5/6 objects. The primary data and replica usage will reflect the data written in a thin-provisioned form. If the Object Space Reservations (discussed later) the reserved capacity will be reflected here.
Figure 13-10. Usage breakdown.
Recommendation: Do not use the datastore view when attempting to understand cluster capacity usage. Use the cluster capacity view under Monitor > vSAN.
Now let's look at how thin provisioning and granular settings for resilience look at a VM level.
Presentation of Capacity at the VM Level
The "VMs" view within vCenter Server is a great place to understand how vSAN displays storage usage. For the following examples, several VMs have been created on a vSAN cluster, each with a single VMDK that is 100GB in size. The volume consists of the guest OS (roughly 4GB) and an additional 10GB of data created. This results in about 14GB of the 100GB volume used. Each VM described below will have a unique storage policy assigned to it to demonstrate how the values in the "Provisioned Space" and "Used Space" will change.
VM01. This uses a storage policy with a Level of Failure to Tolerate of 0, or FTT=0, meaning that there is no resilience of this data. We discourage the use of FTT=0 for production workloads, but for this exercise, it will be helpful to learn how storage policies impact the capacity consumed. In Figure 13-11 we can see the Provisioned space of 101.15GB, and the used space of 13.97GB is what we would expect based on the example described above.
Figure 13-11. Provisioned and used space for VM01 with FTT=0.
VM02. This uses a storage policy that specifies an FTT=1 using a RAID-1 mirror. Because this is a mirror, it is making a full copy of the data elsewhere (known as a replica object), so you will see below that both the provisioned space and the used space double when compared to VM01 using FTT=0.
Figure 13-12. Provisioned and used space for VM02 with FTT=1 using a RAID-1 mirror.
VM03. This uses a storage policy that specifies an FTT=1 using a RAID-5 erasure code on a single object to achieve resilience. This is much more efficient as a mirror, so you will see below that both provisioned space and used space as 1.33x the capacity when compared to VM01 using FTT=0.
Figure 13-13. Provisioned and used space for VM03 with FTT=1 using a RAID-5 erasure code.
VM04. This uses a storage policy that specifies an FTT=2 using RAID-6 erasure code on a single object to achieve high resilience. As shown below, the provisioned space and used space will be 1.5x the capacity when compared to VM01 using FTT=0.
Figure 13-14. Provisioned and used space for VM04 with FTT=2 using a RAID-6 erasure code.
VM05. This uses a storage policy that specifies an FTT=1 through a RAID-1 mirror (similar to VM02), but the storage policy has an additional rule known as Object Space Reservation set to "100." You will see below that even though the provisioned space is guaranteed on the cluster, the used space remains the same. as VM02 also using an FTT=1 through a RAID-1 mirror.
Figure 13-15. Provisioned and used space for VM05 with FTT=1 using a RAID-5 erasure code with an OSR=100.
The examples above demonstrate the following:
- The "Provisioned Space" and "Used Space" columns will change values when adjusting storage policy settings. This is different than traditional storage arrays that use a global resilience and efficiency setting.
- The value of provisioned space in this VM view does not reflect any storage-based thin provisioning occurring. This is consistent with storage arrays.
- vSAN Object space reservations do not change the values in the provisioned or used space categories.
Note that “Provisioned Space” may show different values when powered off, versus when powered on. This provisioned space is a result of the capacity of all VMDKs in addition to the VM memory object, VM Namespace object, etc.
Object Space Reservations
Storage Policies can use an optional rule known as Object Space Reservations (OSR). It serves as a capacity management mechanism where, when used, it reserves storage so that the provisioned space assigned to the given VMDK is guaranteed to be fully available. A value of 0 means the object will be thin provisioned, while a value of 100 means that it will be an equivalent of thick provisioned. It is a reservation in the sense that it proactively reserves the capacity in the cluster, but does not alter the VMDK or that file system inside of VMDK. Traditional thick provisioning such as Lazy Zeroed Thick (LZT) and Eager Zeroed Thick (EZT) are actions that occur inside of a guest VM against the file system on the VMDK.
While the Provisioned Space and Used Space values in the enumerated list of VMs do not change when using a storage policy that has an OSR of 100 sets, the Object space reservation will be represented in the cluster capacity shown in Figure 13-16, in a light green color. While OSR settings will be honored when using Deduplication & Compression, or the Compression-Only service, the OSR values as shown on the Capacity Overview below will no longer be visible.
Figure 13-16. Objects that use an OSR will have their reservations noted at the cluster level.
Setting an OSR in a storage policy can be useful for certain use cases, but this type of reserved allocation should not be used unless it is necessary, as it will have the following impacts on your cluster:
- Negates the benefit of thin provisioning.
- Eliminates the benefit of reclamation techniques such as TRIM/UNMAP on guest VMs.
- Reduces space efficiency when using deduplication & compression.
The OSR settings do have a significant benefit over a thick provisioned VMDK, which is the reservation of capacity can be easily removed through a single change in a storage policy.
Recommendation: Unless an ISV specifically requires it, DO NOT set VMDKs or use VM templates with VMDKs that are thick provisioned. Its usefulness for performance is questionable, as the assumption that first writes take more effort than second writes is not always a given with modern storage systems. If you must, simply set the OSR to 100 for desired objects that you want their capacity guaranteed while avoiding any thick provisioning inside of a guest VMDK, which gives you the flexibility of changing at a later time.
Summary
vSAN takes a different approach in presenting storage capacity because it offers granular settings that are typically not available with a traditional storage array. Understanding how vSAN presents capacity usage will help in your day-to-day operations and strategic planning of resources across your environments.
Resize Custom Namespace Objects
The ability to resize custom namespace objects such as ISO directories and content libraries was introduced in vSAN 8 U1, supporting both ESA and OSA. This capability is not available in the UI, but is available via API, and through PowerCLI using multiple cmdlets. The example below demonstrates the ability to view and resize the desired namespace object.
After connecting to VI server, PS C:\Users\User1> $services = Get-view 'ServiceInstance' PS C:\Users\User1> $datastoreMgr = Get-view $services.Content.DatastoreNamespaceManager PS C:\Users\User1> $datacenter = get-datacenter PS C:\Users\User1> $datastoreMgr.DeleteDirectory($datacenter.ExtensionData.MoRef,"/vmfs/volumes/vsan:526916282a8ec9e1-95c4972ba093a2ec/6771b663-4b89-170b-f416-0200368ec988") PS C:\Users\User1> $datastore = get-datastore PS C:\Users\User1> $datastoreMgr.CreateDirectory($datastore.ExtensionData.MoRef,"CodyTest2",$null,16777216) /vmfs/volumes/vsan:526916282a8ec9e1-95c4972ba093a2ec/fa77b663-8748-3a1e-e9a6-0200368ec988 PS C:\Users\User1> $datastoreMgr.QueryDirectoryInfo($datacenter.ExtensionData.MoRef,"/vmfs/volumes/vsan:526916282a8ec9e1-95c4972ba093a2ec/fa77b663-8748-3a1e-e9a6-0200368ec988") Capacity Used -------- ---- 16777216 2223 PS C:\Users\User1> $datastoreMgr.IncreaseDirectorySize($datacenter.ExtensionData.MoRef, "/vmfs/volumes/vsan:526916282a8ec9e1-95c4972ba093a2ec/fa77b663-8748-3a1e-e9a6-0200368ec988", 33554432)
Section 14: Monitoring vSAN Health
Remediating vSAN Health Alerts
The vSAN Skyline health UI provides an end-to-end approach to monitoring and managing the environment. Health finding alerts are indicative of an unmet condition or deviation from expected behavior.
The alerts can typically stem out of:
- Configuration inconsistency
- Exceeding software/hardware limits
- Hardware incompatibility
- Failure conditions
The ideal methodology to resolve a Skyline health alert is to correct the underlying situation. An administrator can choose to suppress the alert in certain situations.
For instance, a build recommendation engine health check validates whether the build versions are the latest for the given hardware (as per the VMware Compatibility Guide). Some environments are designed to stay with the penultimate release as a standard. You can suppress the alert in this case. In general, you should determine the root cause and fix the issue for all transient conditions. The health check alerts that flag anomalies for intended conditions can be suppressed.
Each health finding mainly has the following two sections:
- Current state: Result of the health check validation against the current state of the environment
- Info: Information about the health check and what it validates
FIGURE 14-1: The details available within a health check alert
The “Info” section explains the unmet condition and the ideal state. Clicking on the “Ask VMware” button triggers a workflow to a Knowledge Base article that describes the specific health check in greater detail, the probable cause, troubleshooting, and remediation steps.
Recommendation: Focus remediation efforts on addressing the root cause. Ensure sustained network connectivity for up-to-date health checks.
vSAN 7 U2 introduced the ability to view the health history of most health checks listed in the Skyline Health UI. It offers a simple, time-based view of the highlighted health check to determine the change in status over the course of time. This can be extremely helpful in providing necessary context to transient error conditions, or error conditions that may be symptomatic from another health alert triggered at the same time. The "View Health History" toggle can be easily toggled on or off on an as-needed basis.
vSAN 7 U2 also introduced a new global notification alert. The new alert is labeled “vSAN Health Service Alarm for Overall Health Summary" in the UI and is disabled by default. This allows for an administrator to subscribe to a single alert (via email notifications, SNMP, or script) so that any vSAN related alert will be sent to the administrator. This helps minimize the need to configure alerting for each discrete alarm which would not catch new alarm definitions added with product updates.
vSAN 7 U3 introduced a new health check correlation feature. Often times a condition in a cluster can trigger multiple health check alerts. The health check correlation feature helps you determine the potential root cause of the issue for more effective troubleshooting. Addressing the root cause of the issue will likely clear up all of the other related health checks that were triggered.
In vSAN 8 U1, Skyline Health has been completely revamped to incorporate better diagnostics and remediation, and a quick at-a-glance cluster health score mechanism to determine the state of the cluster. For more information, see the post: "Skyline Health Scoring, Diagnostics and Remediation in vSAN 8 U1."
Recommendation: Focus on remediation of the root cause, or "Primary Issue" as noted in the Skyline Health UI for vSAN. This will help avoid unnecessary time spent on other health check conditions that will otherwise be corrected once the primary issue is addressed.
Summary
vSAN health helps ensure optimal configuration and operation of your HCI environment to provide the highest levels of availability and performance.
Checking Object Status and Health When There Is a Failure in a Cluster
An object is a fundamental unit in vSAN around which availability and performance are defined. This is done by abstracting the storage services and features of vSAN and applying them at an object level through SPBM. For a primer on vSAN objects and components, see the post: "vSAN Objects and Components Revisited."
At a high level, an object’s compliance with the assigned storage policy is enough to validate its health. In certain scenarios, it may be necessary to inspect the specific state of the object, such as in a failure.
In the event of a failure, ensure all objects are in a healthy state or recovering to a healthy state. vSAN object health check provides a cluster-wide overview of the object’s health and its respective states. This health check can be accessed by clicking on the vSAN cluster and viewing the Monitor tab. The data section comprises information specific to the object health check.
FIGURE 14-2: Viewing object health with the vSAN health checks
On failure detection, vSAN natively initiates corrective action to restore a healthy state. This, in turn, reinstates the object’s compliance with the assigned policy. The health check helps quickly assess the impact and validates that restoration is in progress. In certain cases, based on the nature of failure and the estimated restoration time, an administrator may choose to override or expedite the restoration. More information is available on failure handling in vSAN.
Recommendation: SPBM governs how, where, and when an object is to be rebuilt. It is generally not required or recommended to override this unless warranted.
Summary
It is not uncommon for components such as disks, network cards, or server hardware to fail. vSAN has a robust and highly resilient architecture to tolerate such failures by distributing the objects across a cluster. vSAN object health
Monitoring and Management of vSAN Object Components
VMware vSAN uses a data placement approach that is most analogous to an object store. VMs that live on vSAN storage are comprised of several storage objects - which can be thought of as a unit of data. VMDKs, VM home namespace, VM swap areas, snapshot delta disks, durability data, and snapshot memory maps are all examples of storage objects in vSAN. Object data is placed across hosts in the cluster in a manner that ensures data resilience. Resilience, space efficiency, security, and other settings related to a vSAN object are easily managed by the Administrator through the use of storage policies. For a primer on vSAN objects and components, see the post: "vSAN Objects and Components Revisited."
A vSAN object is comprised of one or more "components." Depending on the object size, applied storage policy, and other environmental conditions, an object may consist of more than one component. This sharded data is simply an implementation detail of vSAN and not a manageable entity.
The vSAN OSA has a maximum of 9,000 components per vSAN host. A small 4-host vSAN cluster would have a limit of 36,000 components per cluster whereas a larger 32-host vSAN cluster would have a limit of 288,000 components per cluster. The Skyline Health Service for vSAN in vCenter Server includes a health check called "Component" that monitors component count at a cluster level and per-host level and alerts if the hosts are nearing their threshold. A yellow health alert warning will be triggered at 80% of the host component limit, while a red health alert error will be triggered at 90% of the host component limit.
Note that the vSAN ESA has a maximum component limit of 27,000 per host. This 3x increase is to help ensure that the additional components used in the ESA do not hit the previous limit of 9,000. At this time, the maximum supported number of VMs per host for vSAN (ESA and OSA) remains 200.
A component limit per host exists primarily to keep host resource consumption to reasonable levels. A distributed scale-out storage system like vSAN must create and manage data about the data. This metadata is what allows for the seamless scalability and adaptability of a vSAN cluster. As data is sharded into more components, additional resources may be consumed to manage the data. The component limit helps keep memory and CPU requirements of vSAN to reasonable levels while still maintaining resources for guest VM consumption. In most cases, the hard limit of 9,000 components per host should be sufficient for the soft limit of 200 VMs per vSAN host.
The component and VM limits add another dimension to capacity management considerations that will impact both the design of a vSAN cluster, as well as the operation of a vSAN cluster. See the section topic "Estimating Approximate 'Free/Usable Space in vSAN Cluster" for more details.
Recommendations to Mitigate
There can be some circumstances where component counts in a vSAN cluster are approaching their limit. In those relatively rare cases, the following recommendations can help mitigate these issues.
- Minimize use of storage policies using the stripe width rule. The stripe width rule aims to improve the performance of vSAN under very specific conditions. In many scenarios, it will unnecessarily increase the component count while having little to no effect on performance. See: Using Number of Disk Stripe Per Object on vSAN-Powered Workloads in the vSAN Operations Guide for more detail.
- Choose your data placement scheme (RAID-1, RAID-5 or RAID-6) wisely. For smaller VMs, RAID-1 data placement schemes will typically produce the fewest number of components. While VMs consuming more capacity may use fewer components per object in a RAID-5 or RAID-6 erasure code. This is because a component has a maximum size of 255 GB. If a VMDK is 700GB in size, under a FTT=1, RAID-1 data placement scheme, it would create a total of 7 components across three hosts: 3 components for each object replica, and 1 component for the witness. This same 700GB object using a RAID-5 erasure code would create just 4 components across four hosts.
- Upgrade to the latest version of vSAN. Recent updates to vSAN reduce the number of components created for objects using RAID-5/6 erasure coding, especially for objects larger than 2TB. See the post: Stripe Width Improvements in vSAN 7 U1 for more details.
- Ensure sufficient free capacity in the cluster. Capacity constrained clusters may split components up into smaller chunks to fit the component on an available host which can increase component count.
- Maintain good snapshot hygiene. Snapshots create additional objects (applicable to OSA only), which adds to the overall component count, often times doubling the component count of the VM with each and every snapshot. Make sure that snapshots are used only for temporary purposes, such as with your VADP-based backup solution or API driven solution such as VMware Horizon using clones. Ensure that if you are using snapshots as a part of an existing workflow, that they are temporary, and are not retaining multiple snapshots for a given object.
- Add another host to the vSAN cluster. Adding another host can quickly relieve the pressure of clusters approaching a per-host maximum. Let's use a 7-host vSAN cluster with a theoretical component limit of 63,000 as an example. A component count of 57,000 would trigger a red health alert error, as it exceeds 90% utilization. Adding a host to the cluster would increase the theoretical component limit to 72,000. The component count of 57,000 would not even trigger a health alert warning, as it falls below the 80% threshold. Once the host was added, the Automatic Rebalancing in a vSAN cluster would take steps to evenly distribute the data across the hosts.
Summary
vSAN components are an implementation detail of vSAN's object-based architecture. While there is no need to manage the underlying components that make up the vSAN objects, the recommendations above should help minimize the conditions in which hosts within a vSAN cluster are near their component limit.
Viewing vSAN Cluster Partitions in the Health Service UI
vSAN inherently employs a highly resilient and distributed architecture. The network plays an important role in accommodating this distributed architecture.
Each host in the vSAN cluster is configured with a VMkernel port tagged with vSAN traffic and should be able to communicate with other hosts in the cluster. If one or more hosts are isolated, or not reachable over the network, the objects in the cluster may become inaccessible. To restore, resolve the underlying network issue.
There are multiple network-related validations embedded as part of the health service to detect and notify when there is an anomaly. These alerts ought to be treated with the highest priority, specifically the vSAN cluster partition. Health service UI can provide key diagnostic information to help ascertain the cause.
Recommendation: Focus on discovering the root cause of the cluster partition issue. A triggered cluster partition health check is often a symptomatic triggered alert as a result of some other issue that was the cause. If the cause is from an issue captured by another Skyline health check, this will show up in the new health check correlation feature as the "Primary issue." of the cluster partition.
Accessing the health service UI
The vSAN Skyline Health service UI provides a snapshot of the health of the vSAN cluster and highlights areas needing attention. Each health check validates whether a certain condition is met. It also provides guidance on remediation when there is a deviation from expected behavior. The UI can be accessed by clicking on the vSAN cluster and viewing the Monitor tab. The specific “vSAN cluster partition” health check is a good starting point to determine the cluster state. A partition ID represents the cluster as a single unit. In an ideal state, all hosts reflect the same partition ID. Multiple subgroups within the cluster indicate a network partition requiring further investigation. At a micro level, this plausibly translates to an object not having access to all
of its components.
FIGURE 14-3: Identifying unhealthy network partitions in a vSAN cluster
The network section in the health service UI has a plethora of network tests that cover some basic yet critical diagnostics, such as ping, MTU Check, Unicast connectivity, and host connectivity with vCenter. Each health check can systematically confirm or eliminate a layer in the network as the cause.
Recommendation: As with any network troubleshooting, a layered methodology is strongly recommended (top-down or bottom-up).
Summary
With vSAN-backed HCI, data availability is directly dependent on the network, unlike traditional storage. Built-in network-related health checks aid in early detection and diagnosis of network-related issues.
Verifying CEIP Activation
Customer Experience Improvement Program (CEIP) is a phone-home system that aids in collecting telemetry data, shipping it over to VMware’s Analytics Cloud (VAC) at regular intervals. The feature is enabled by default in the recent releases. The several benefits |in joining CEIP are described here: “vSAN Support Insight.” There are a few validation steps required to ensure that the telemetry data is shipped and available in VAC.
The verification process is twofold:
- Ensuring the feature is enabled in vCenter in the environment
- Ensuring the external network allows communication from vCenter to VAC
The first step is fairly straightforward and in the purview of a vSphere admin to log in and check from the vCenter. The second step has a dependency on the external network and security setup.
Validation in vCenter
The new HTML5 client has an improved UI categorization of the tabs. CEIP is categorized under Monitor → vSAN → Support. The following screenshot depicts a representation while it is enabled.
FIGURE 14-4: Checking the status of CEIP
Alternatively, this can also be verified by traversing to Menu → Deployment → Customer Experience Improvement Program.
External network and security validation
For CEIP to function as designed, the vCenter server needs to be able to reach VMware portals: vcsa.vmware.com and vmware.com. The network, proxy server (if applicable), and firewall should allow outbound traffic from vCenter to the portals above. The network validation is made easy with an embedded health check, “Online health connectivity,” which validates internet connectivity from vCenter to the VMware portal. Alternatively, this can also be verified manually through a secure shell from vCenter.
Sample command and output (truncated for readability):
vcsa671 [ ~ ]$ curl -v https://vcsa.vmware.com:443
* Rebuilt URL to: https://vcsa.vmware.com:443
* Connected to vcsa.vmware.com (10.113.62.242) port 443 (#0)
Recommendation: Ensure CEIP is enabled to benefit from early detection of issues, align with best practices, and faster resolution times.
Summary
CEIP aids in relaying critical information between the environment and VMware Analytics Cloud that can help improve the product experience. It is enabled by default and an embedded health check can be used to periodically monitor connectivity between Center and VMware portals.
Monitoring and Management of Isolated vSAN Environments
Some environments require full isolation from any management access of a vSAN cluster to the Internet. While this is quite easy to do, it can pose additional operational challenges in asynchronous health check updates, troubleshooting and incident support with VMware Global Support.
vSAN 7 U2 introduces support for the VMware Skyline Health Diagnostics Tool (SHD). The Skyline Health Diagnostics tool is a self-service tool that brings some of the benefits of Skyline health directly to an isolated environment. The tool is run by an administrator at a frequency they desire. It will scan critical log bundles to detect issues, and give notifications and recommendations to important issues and their related KB articles. The goal for our customers is a faster time to resolution for issues, and for isolated environments, this is the tool to help with that.
For further information, you can read about the Skyline Health Diagnostics tool at: Introducing the VMware Skyline Health Diagnostics Tool
Recommendation: For isolated environments, use the SHD Tool and the VMware vSAN VCG Notification Service together to help improve the management of your isolated vSAN clusters. The VCG notification service can provide hardware-based compatibility updates without any connectivity of the infrastructure to the internet. Administrators or others can view and subscribe to change notifications of compatibility in hardware against the VCG, as it relates to firmware, hypervisor versions, and drivers. More information at: vSAN VCG Notification Service - Protecting vSAN Hardware Investments.
Summary
The Skyline Health Diagnostics tool introduces all new levels of visibility and flexibility for isolated vSAN environments, and should be implemented in any environment where external connectivity is limited.
Cluster Health Scoring in vSAN
vSAN 8 U1 (both the ESA and OSA) introduce a completely new way to consume and render the intelligence from the Skyline Health for vSAN engine. The result is an impressive new way to quickly determine the state of the cluster, and remediate any outstanding issues. For more information, see the post: "Skyline Health Scoring, Diagnostics and Remediation in vSAN 8 U1."
Section 15: Monitoring vSAN Performance
Navigating Across the Different Levels of Performance Metrics
The vSAN performance service provides storage-centric visibility to a vSAN cluster. It is responsible for collecting vSAN performance metrics and presents them in vCenter. A user can set the selectable time window from 1 to 24 hours, and the data presented uses a 5-minute sampling rate. The data may be retained for up to 90 days, although the actual time retained may be shorter based on environmental conditions.
vSAN 8 U1 introduces all new high resolution performance metrics. This allows the ability for the administrator to monitor critical performance metrics using 30 second intervals, which will be much more representative of the actual workload than the longer, 5 minute intervals. For more information, see the post: "High Resolution Performance Monitoring in vSAN 8 U1." This capability is available in the ESA and the OSA.
Levels of navigation
The vSAN performance service presents metrics at multiple locations in the stack. As shown in FIGURE 15-1, vSAN-related data can be viewed at the VM level, the host level, the disk and disk group level, and the cluster level. Some metrics such as IOPS, throughput, and latency are common at all locations in the stack, while more specific metrics may only exist at a specific location, such as a host. The performance metrics can be viewed at each location simply by highlighting the entity (VM, host, or cluster) and clicking on
Monitor → vSAN → Performance.
FIGURE 15-1: Collects and renders performance data at multiple levels
The metrics are typically broken up into a series of categories, or tabs at each level. Below is a summary of the tabs that can be found at each level.
- VM Level
- VM: This tab presents metrics for the frontend VM traffic (I/Os to and from the VM) for all VMs on the selected host.
- Virtual disk: This presents metrics for the VM, broken down by the individual VMDK and especially helpful for VMs with multiple VMDKs.
- Host Level
- VM: This tab presents metrics for the frontend VM traffic (I/Os to and from the VM) for all VMs on the selected host.
- Backend: This tab presents metrics for all backend traffic, as a result of replica traffic and resynchronization data.
- Disks: This tab presents performance metrics for the selected disk group, or the individual devices that compose the disk group(s) on a host.
- Physical adapters: This tab presents metrics for the physical uplink for the selected host.
- Host network: This tab presents metrics for the specific or aggregate VMkernel ports used on a host.
- iSCSI: This tab presents metrics for objects containing data served up by the vSAN iSCSI service.
- Cluster Level
- VM: This tab presents metrics for the frontend VM traffic (I/Os to and from the VM) for all VMs living on the selected host.
- Backend: This tab presents metrics for all backend traffic as a result of replica traffic and resynchronization data.
- iSCSI: This tab presents metrics for objects containing data served up by the vSAN iSCSI service.
vSAN 7 U1 introduced a new VM consolidated performance view. This solves some of the difficulties when attempting to compare performance metrics of more than one VM, side by side, and can be extremely helpful in doing comparisons and correlations. vSAN 7 U2 introduced a vSAN Top Contributors view at the cluster level. This will help administrators quickly see VMs and disk groups that contribute to the most demand on resources provided by the vSAN cluster, and pairs nicely with the VM Consolidated performance view.
Typically, the cluster level is an aggregate of a limited set of metrics, and the VM level is a subset of metrics that pertain to only the selected VM. The host level is the location at which there will be the most metrics, especially as it pertains to the troubleshooting process. A visual mapping of each category can be found in FIGURE 15-2.
FIGURE 15-2: Provides vSAN-specific metrics and other vSphere-/ESXi-related metrics
Note that the performance service can only aggregate performance data up to the cluster level. It will not be able to provide aggregate statistics from multiple vSAN clusters. Aria Operations can achieve that result. Which are most important? They all relate to each other in some form or another. The conditions of the environment and the root cause of a performance issue will dictate which metrics are more significant than another. For more general information on troubleshooting vSAN, see the topic “Troubleshooting vSAN Performance” in this document. For a more detailed understanding of troubleshooting performance as well as definitions to specific metrics found in the vSAN performance service, see “Troubleshooting vSAN Performance” on core.vmware.com.
Recommendation: If you need longer periods of storage performance retention, use Aria Operations. The performance data collected by the performance service does not persist after the service has been turned off then back on. Aria Operations fetches the performance data directly from the vSAN performance service, so the data will be consistent yet remain intact if the performance service needs to be disabled and enabled.
The information provided by the vSAN performance service (rendered in the vCenter Server UI) is the preferred starting point for most performance data collection and analysis scenarios. Depending on the circumstances, there may be a need for additional tooling that exposes different types of data, such as the new IOInsight as a part of vSAN 7 U1. Another helpful tool introduced with vSAN 7 U3 is the VM I/O Trip Analyzer. This aims to help administrators identify the primary points of contention (bottlenecks) more easily. A list of common tools used for performance diagnostics are listed in Appendix B of the Troubleshooting vSAN Performance document.
Summary
The vSAN performance service is an extremely powerful feature that, in an HCI architecture, takes the place of storage array metrics typically found on in a three-tier architecture. Since vSAN is integrated directly into the hypervisor, the performance service offers metrics at multiple levels in the stack, and can provide outstanding levels of visibility for troubleshooting and further analysis.
Troubleshooting vSAN Performance
Troubleshooting performance issues is a common challenge for many administrators, regardless of the underlying infrastructure and topology. A distributed storage platform like vSAN also introduces other elements that can influence performance, and the practices for troubleshooting should accommodate those. Use the metrics in the vSAN performance service to isolate the sources of the performance issue.
While originally developed prior to the debut of the ESA, the framework described is generally applicable to ESA clusters as well.
The performance troubleshooting workflow
The basic framework for troubleshooting performance in a vSAN environment is outlined in FIGURE 15-3. Each of the five steps is critical to identifying the root cause properly and mitigating it systematically.
FIGURE 15-3: The troubleshooting framework
“Troubleshooting vSAN Performance” on core.vmware.com contains a more complete understanding of the performance troubleshooting process.
The order of review for metrics
Once steps 1–3 have been completed, begin using the performance metrics. The order in which the metrics are viewed can help decipher what level of contention may be occurring. FIGURE 15-4 shows the order in which to better understand and isolate the issue; it is the same order used in “Appendix C: Troubleshooting Example” in “Troubleshooting vSAN Performance.”
FIGURE 15-4: Viewing order of performance metrics
Here is a bit more context to each step:
- View metrics at the VM level to confirm unusually high storage related latency. This must be verified that there is in fact storage latency as seen by the guest VM.
- View metrics at the cluster level to provide context and look for other anomalies. This helps identify potential “noise” coming from somewhere else in the cluster.
- View metrics on the host to isolate the type of storage I/O associated with the latency.
- View metrics on the host, looking at the disk group level to determine type and source of latency.
- View metrics on the host, looking at the host network and VMkernel metrics to determine if the issue is network related.
Steps 3–5 assume that one has identified the hosts where the VM’s objects reside. Host-level metrics should look at only the hosts where the objects reside for the particular VM in question. For further information on the different levels of performance metrics in vSAN, see the topic “Navigating Across the Different Levels of Performance Metrics.”
Viewing metrics at the disk group level can provide some of the most significant insight of all metrics . However, they shouldn’t be viewed in complete isolation, as there will be influencing factors that affect these metrics.
Recommendation: Be diligent and deliberate when changing your environment to improve performance. Changing multiple settings at once, overlooking a simple configuration issue, or not measuring the changes in performance can often make the situation worse, and more complex to resolve.
Summary
While tracking down the primary contributors to performance issues can be complex, there are practices to help simplify this process and improve the time to resolution. This information, paired with the “Troubleshooting vSAN Performance” guide on core.vmware.com, is a great start to better understanding how to diagnose and address performance issues in your own vSAN environment.
Monitoring Resynchronization Activity
Resynchronizations are a common activity that occur in a vSAN environment. They are simply the process of replicating the data across the vSAN cluster so it adheres to the conditions of the assigned storage policy that determines levels of resilience, space efficiency, and performance. Resynchronizations occur automatically and are the result of policy changes to an object, host or disk group evacuations, rebalancing of data across a cluster, and object repairs should vSAN detect a failure condition.
Methods of visibility
Resynchronization visibility occurs in multiple ways: through vCenter, Aria Operations, and PowerCLI. The best method depends on what you attempt to view, and familiarity with the tools available.
Viewing resynchronizations in vCenter
Resynchronization activity can be found in vCenter in two different ways:
- At the cluster level as an enumerated list of objects currently being resynchronized
- At the host level as time-based resynchronization metrics for IOPS, throughput, and latency
Find the list of objects resynchronizing in the cluster by highlighting the cluster and clicking on Monitor → vSAN → Resyncing Objects, as shown in FIGURE 5-5.
FIGURE 15-5: Viewing the status of resynchronization activity at the cluster level
Find time-based resynchronization metrics by highlighting the desired host and clicking on Monitor → vSAN → Performance → Backend, as shown in FIGURE 15-6.
FIGURE 15-6: A breakdown of resynchronization types found in the host-level view of the vSAN performance metrics
Recommendation: Discrete I/O types can be “unticked” in these time-based graphs. This can provide additional clarity when deciphering the type of I/O activity occurring at a host level.
Viewing resynchronizations in Aria Operations
Aria Operations 7.0 and later have all new levels of visibility for resynchronizations in a vSAN cluster. It can be used to augment the information found in vCenter, as the resynchronization intelligence found in Aria Operations is not readily available within the vSAN performance metrics found in Center.
Aria Operations can provide an easy-to-read resynchronization status indicator for all vSAN clusters managed by the vCenter server. FIGURE 15-7 displays an enumerated list of all vSAN clusters managed by the vCenter server, and the current resynchronization status.
FIGURE 15-7: Resynchronization status of multiple vSAN clusters
Aria Operations provides burn down rates for resynchronization activity over time. Measuring a burn down rate helps provide the context in a way that can be difficult to understand using simple resynchronization throughput statistics. A burn down graph for resynchronization activity provides an understanding of the extent of data queued for resynchronization, how far along the process is, and a trajectory toward completion. Most importantly, it measures this at the cluster level, eliminating the need to gather this data per host to determine the activity across the entire cluster.
Aria Operations renders resynchronization activity in one of two ways:
- Total objects left to resynchronize
- Total bytes left to resynchronize
A good example of this is illustrated in a simple dashboard shown in FIGURE 15-8, where several VMs had their storage policy changed from using RAID-1 mirroring to RAID-5 erasure coding.
FIGURE 15-8: Resynchronization burn down rates for objects, and bytes remaining
When paired, the “objects remaining” and “bytes left” can help us understand the correlation between the number of objects to be resynchronized, and the rate at which the data is being synchronized. Observing rates of completion using these burn down graphs helps better understand how Adaptive Resync in vSAN dynamically manages resynchronization rates during periods of contention with VM traffic. These charts are easily combined with VM latency graphs to see how vSAN helps prioritize different types of traffic under these periods of contention.
Burn down graphs can provide insight when comparing resynchronization activities at other times, or in other clusters. For example, FIGURE 15-9 shows burn down activity over a larger time window. We can see that the amount of activity was very different during the periods that resynchronizations occurred.
FIGURE 15-9: Comparing resynchronization activity—viewing burn down rates across a larger time window
The two events highlighted in FIGURE 15-9 represent a different quantity of VMs that had their policies changed. This is the reason for the overall difference in the amount of data synchronized.
Note that, as of Aria Operations 7.5, the visibility of resynchronization activity is not a part of any built-in dashboard. But you can easily add this information by adding widgets to a new custom dashboard or an existing dashboard.
Viewing resynchronizations in PowerCLI
Resynchronization information can be gathered at the cluster level using the following PowerCLI command:
Get-VsanResyncingComponent -Cluster (Get-Cluster -Name “Clustername”)
Additional information will be shown with the following:
Get-VsanResyncingComponent -Cluster (Get-Cluster -Name “Clustername”) |fl
See the “PowerCLI Cookbook for vSAN” for more PowerCLI commands and how to expose resynchronization data.
Summary
Resynchronizations ensure that data stored in a vSAN cluster meets all resilience, space efficiency, and performance requirements as prescribed by the assigned storage policy. They are a normal part of any properly functioning vSAN environment and can be easily viewed using multiple methods.
Network Monitoring of vSAN powered clusters
Understanding the health and performance of a network is an important part of ensuring a hyper converged platform like vSAN is running at its very best. A distributed storage system like vSAN depends heavily on the network that connects the hosts, as it is the hosts in the cluster that make up the storage system. Network interruptions can generate packet loss in ways that even at relatively low levels, can degrade the effective throughput of communication required for transmission of storage.
vSAN 7 U2 introduces several new metrics to monitor network connectivity. These new metrics with customizable thresholds compliment the other network metrics in editions prior to vSAN 7 U2. In addition to the new and existing network metrics listed above, there are additional metrics found in the “Performance for Support” section of the vCenter Server UI. These are metrics typically intended for GSS cases, but still visible to the administrator.
Recommendation: Use some tools for visibility into the operation of the network switches. Network switch configurations and performance are outside of the management domain of vSphere, but should be monitored with equal levels of importance. Something as simple as an open source solution such as Cacti may be suitable.
vSAN 7 U3 introduces new metrics and health checks to provide better visibility into the switch fabric that connects the vSAN hosts, and ensure higher levels of consistency across a cluster. Duplicate IP detection is now a part of the health checks, as well as LACP synchronization issues that can occur with these LAG configurations. vSAN 7 U3 even adds a configuration status check for the participating network interface cards and their configuration status of LRO/TSO. Ensuring consistency of LRO/TSO settings will help detect issues that may come as a result of the inconsistent configurations.
Mitigating Network Connectivity Issues in a vSAN cluster
As with any type of distributed storage system, VMware vSAN is highly dependent on the network to provide reliable and consistent communication between hosts in a cluster. When network communication suffers, impacts may not only be seen in the expected performance of the VMs in the cluster but also with vSAN's automated mechanisms that ensure data remains available and resilient in a timely manner.
In this type of topology, issues unrelated to vSAN can lead to the potential of a systemic issue across the cluster because of vSAN's dependence on the network. Examples include improper firmware or drivers for the network cards used on the hosts throughout the cluster, or perhaps configuration changes in the switchgear that are not ideal. Leading indicators of such issues include:
- Much higher storage latency than previously experienced. This would generally be viewed at the cluster level, by highlighting the cluster, clicking Monitor > vSAN > Performance and observing the latency.
- Noticeably high levels of network packet loss. Degradations in storage performance may be related to increased levels of packet loss occurring on the network used by vSAN. Recent editions of vSAN have enhanced levels of network monitoring and can be viewed by highlighting a host, clicking on vSAN > Performance > Physical Adapters, and looking at the relevant packet loss and drop rates.
Remediation of such issues may require care to minimize potential disruption and expedite the correction. When the above conditions are observed, VMware recommends holding off on any corrective actions such as host restarts and reaching out to VMware Global Support Services (GS) for further assistance.
Summary
The network providing the fabric of connectivity for vSAN hosts play a critical role in the overall level of performance, availability, and consistency. vSAN 7 U2 introduces new metrics to help improve visibility to the underlying network.
About the Authors
This documentation was a collaboration across the vSAN Technical Marketing team. The guidance provided is the result of extensive product knowledge and interaction with vSAN; discussions with vSAN product, engineering, and support teams; as well as scenarios commonly found in customer environments.