vSAN Top 10 Operational Tips
vSAN provides powerful capabilities that can be customized to your requirements. Before getting started, verify configuration and operational processes are following VMware recommendations. These tips will help provide the best performance, highest availability, and ease of operations.
Setup and Configuration
- Tip #1: Use Multiple Storage Policies
Storage Policy Based Management (SPBM) enables precise, per-VM management of resilience and capacity consumption. The use of multiple vSAN storage policies is recommended. This helps avoid changes that impact a large number of virtual machines, which could cause a temporary spike in capacity utilization if the way data is stored must be reconfigured. An example of this is changing from a RAID-1 mirroring to RAID-5/6 erasure coding. Storage policy changes should be made to smaller groups of virtual machines depending on the amount of free space on the vSAN datastore. For more details, see this following blog article: vSAN Operations: Maintain Slack Space for Storage Policy Changes
- Tip #2: Use Cluster QuickStart and VMware vSphere Update Manager to Configure, Expand, and Upgrade vSAN Clusters
Cluster QuickStart provides a streamlined workflow for configuring a new cluster according to recommendations based on VMware Validated Designs. This feature minimizes the amount of time to setup a new cluster and it helps enforce consistency across the cluster. Consistent configuration based on best practices leads to the best experience in terms or reliability, performance, and ease of management. Cluster QuickStart also improves the process of adding new hosts to an existing cluster. More information about this feature can be found in the following tech note: Cluster QuickStart
vSphere Update Manager (VUM) makes it easy to apply patches and perform upgrades to vSphere and vSAN. VUM baselines are used to determine if an environment should be updated. Updates can be small patches and full version upgrades. If the Customer Experience Improvement Program (CEIP) is enabled, VUM uses the VMware Content Catalog, the VMware Compatibility Guide (VCG), and the specifics of the cluster’s hardware configuration to recommend vSphere and vSAN versions. VUM can also perform firmware and driver updates for select storage controllers.
- Tip #3: Select the Right Tool for HCI Performance Testing
Running a single virtual machine with a synthetic workload does not provide an accurate indication of full vSAN cluster performance. It is very common to have multiple workloads running on a cluster in a production environment. Performance testing should incorporate a similar approach.
HCIBench is recommended for performance testing vSphere-based HCI clusters. It's essentially an automation wrapper around the popular and proven VDbench open source benchmark tool. The tool automates the end-to-end process of deploying test virtual machines, coordinating workload runs, aggregating test results, and collecting data for troubleshooting purposes, if needed.
- Tip #4: Maintain Adequate Slack Space
Similar to nearly all other storage solutions, “slack space” or free space is required for various operations in vSAN environment. Free capacity is needed for tasks and scenarios such as virtual machine snapshots, disk utilization rebalancing, hardware failures, and fault tolerance method reconfiguration. 25-30% is a good starting point. You will see warnings in the vSphere Client when vSAN datastore capacity utilization increases above 80%, i.e., less than 20% free space remains.
- Tip #5: Reboot vSAN Hosts with Caution and Patience
HCI environments distribute data across local drives in each host that are part of the cluster. An offline host means a reduction in the amount of compute and storage resources in an HCI cluster. Therefore, rebooting a host should not be the first step in troubleshooting an issue. Use maintenance mode for planned downtime such as patching and hardware replacement.
Rebooting a vSAN host takes longer than a non-vSAN host. vSAN parses logs on the vSAN cache drives of a host when it reboots. This is to help ensure data integrity before the host returns to service in the vSAN cluster. This additional step naturally requires more time. You can monitor the progress of this activity on the ESXi console by pressing ALT+F11. It is best to remain patient during this process – avoid rebooting a host (again) while it is going through the boot process. See the blog article titled vSAN Operations: Use Out-of-Band Management to view vSphere DCUI During Host Restarts for more information.
- Tip #6: Design Disk Groups for Resilience and Performance
This tip is more of an initial design recommendation, but it also applies to operations such as cluster expansion.
A vSAN disk group consists of one device for the cache tier and one to seven devices for the capacity tier. A host can have up to five disk groups. While one disk group per host is supported, two or more disk groups per host is recommended. This design approach has the following benefits:
- Multiple disk groups mean multiple cache devices. More cache devices can lead to better performance.
- If the cache device belonging to a vSAN disk group fails, the entire disk group is taken offline. A host with multiple disk groups can continue to provide storage to the vSAN cluster even though one of the disk groups is offline.
This blog article provides some guidance on disk group design and sizing: Designing vSAN Disk groups – All Flash Cache Ratio Update.
- Tip #7: Implement a vSphere Distributed Switch
vSphere Distributed Switch (VDS) is included with vSAN licenses. The Network I/O Control (NIOC) feature of VDS enables traffic shaping and limits to prevent traffic types such as vMotion from consuming all of the available bandwidth in a shared networking environment. Lack of sufficient bandwidth for vSAN traffic can impact performance. If vSAN does not have dedicated physical interfaces on each host, NIOC should be used to help ensure vSAN has adequate bandwidth.
Turning on Link Layer Discovery Protocol (LLDP), another feature of VDS, is also recommended. With LLDP, vSphere administrators can determine which physical switch port connects to a given vSphere distributed switch. You can view properties of the physical switch such as chassis ID, system name and description, and device capabilities from the vSphere Client when LLDP is enabled.
- Tip #8: Use Dedicated and Isolated vmkernel Ports
Isolated storage traffic with dedicated, redundant connections can improve performance, uptime, and security. Separate VLANs or subnets should be considered where multiple clusters with different security and compliance requirements are running in the same network environment. vSAN Encryption can be used to protect data-at-rest, but this feature does not encrypt vSAN network traffic. The following blog article provides more information on this topic: Designing vSAN Networks – Using Multiple Interfaces?
Monitoring and Alerting
- Tip #9: Enable CEIP and Pay Attention to vSAN Health
More than 50 health checks are included in vSAN Health. These cover a variety of configuration items and runtime metrics that cover host hardware, network conditions, capacity utilization, firmware and driver versions, configuration consistency, build recommendations, and vSAN object health.
Warnings and critical alerts are easily seen in the vSphere Client. Administrators should respond to these items in a timely manner to help ensure the best performance and highest levels of uptime. vSAN Health assists with remediation efforts by providing specific information about an issue and a link to the VMware Knowledge Base article that provides additional details and resolution steps.
Customers enrolled in CEIP receive dynamic updates to vSAN Health as new issues and recommendations are discovered. CEIP also provides detailed yet secure visibility of customer environments to VMware Global Support Services (GSS). This feature is known as vSAN Support Insight. It can significantly improve the support experience and help reduce the time to issue resolution.
- Tip #10: Deploy vRealize Log Insight
vRealize Log lnsight delivers heterogeneous and highly scalable log management with intuitive, actionable dashboards and sophisticated analytics for vSAN environments. Aggregation of vSphere, vSAN, network logs, etc. provide better overall visibility when an issue arises. This can lead to faster root cause discovery and problem resolution.
This short document provides 10 operations tips for managing and HCI environment powered by vSAN. As with most solutions, you will get the best results when following recommendations provided by the solution vendor. Many of these recommendations are built into the vSAN UI and proactively provided to administrators through features such as CEIP, vSAN Support Insight, and vSAN Health.