vSAN Max Design and Operational Guidance
Introduction
VMware vSAN Max™ is a new storage offering that provides Petabyte-scale disaggregated storage for vSphere clusters, powered by the vSAN Express Storage Architecture™, or ESA. It gives customers an ability to deploy a highly scalable storage cluster to be used as primary storage for vSphere clusters*, or augment storage for traditional vSAN HCI clusters. The new vSAN Max offering will be licensed separately from existing vSAN editions.
vSAN Max is powered by the ESA, which provides tremendous flexibility in meeting performance, capacity, and resilience requirements for all types of environments. Since the ESA is designed to power traditional vSAN HCI clusters and disaggregated vSAN Max clusters, the flexibility of ESA may invite non-optimal configurations when deployed as a vSAN Max cluster. The following is a collection of recommendations in the design, operation, and optimization of vSAN Max clusters.
This document is arranged in the order of Planning and Sizing, Day-0 Initial Deployment and Configuration, and Day-2 Operations to help step you through the guidance in an orderly manner.
* vSAN Max will be offered as a subscription and planned to be licensed on a per tebibyte metric. For the most up to date vSAN Max pricing and packaging, please view the vSAN Licensing Guide.
Planning and Sizing
vSAN ReadyNode Host Specifications and Sizing
Use vSAN ReadyNodes certified for vSAN Max.
vSAN Max should be deployed with vSAN ReadyNodes that are certified for vSAN Max. These are special vSAN ReadyNodes share some similarities to the vSAN-ESA-AF-8 ReadyNode profile, but may consist of unique specifications specifically suited for vSAN Max.
Know what can and cannot be changed in a ReadyNode certified for use with vSAN Max.
The vSAN ESA ReadyNode program for hosts configured in an aggregated, vSAN HCI configuration offers a lot of flexibility in customizing ReadyNode configurations, as noted on the document “What you Can (and Cannot) Change in a vSAN ESA ReadyNode.”
vSAN ESA ReadyNodes designed for vSAN Max may have stricter specifications in order to meet its intended use. Additional resources may become available that will elaborate on allowable modifications to vSAN ReadyNodes certified for use with vSAN Max.
Ensure any adjustments to a ReadyNode certified for vSAN Max do not compromise its design objective.
vSAN ReadyNodes are designed to meet specific performance and capacity objectives. For example, reducing a resource type such as the quantity of storage devices per host to less than 8 devices may inhibit the desired vSAN Max performance capabilities.
Understand how a desired raw cluster capacity can be achieved differently, and how this may affect resource utilization.
vSAN Max clusters can achieve desired capacity goals in many ways. For a given amount of raw capacity serving the same number of VMs, one can use fewer hosts with a higher density of resources per host, or use more hosts with a lower density of resources per host. Each have their advantages and disadvantages. The vSAN ReadyNode sizer will provide what it determines as an ideal configuration based on your design inputs, and in most cases this will be sufficient. One may wish to adjust the specifications to meet your other design preferences.
vSAN Max clusters using fewer hosts with higher density of resources per host.
Advantages | Disadvantages |
---|---|
Lower hardware costs | Larger percentage of resource impact upon a failure of a host |
Able to meet capacity objectives and stay within recommended maximum host count for a vSAN Max cluster (24). | Potentially more strain on any given network uplink supporting host. |
Less rack space* used with fewer network ports used on ToR switches. *Comparison when using server size of the same physical form factor such as 1U, 2U, 4U. |
Higher likelihood of running into per host component limits (27,000) for vSAN ESA. |
Higher number of hosts in client cluster(s) that can mount the vSAN Max datastore while staying under the total host limit (128). | May not meet recommended cluster minimums for desire topology and resilience levels (ex: Stretched clusters using secondary levels of resilience through RAID-6) |
Lower aggregate performance across the cluster because of a higher concentration of workloads (and their working sets) on any given host. |
|
vSAN Max clusters using more hosts with lower density of resources per host.
Advantages | Disadvantages |
---|---|
Lower percentage of resource impact upon a failure of a host. | Higher hardware costs |
Potentially less strain on any given network uplink supporting host. | May not be able to meet capacity objectives while staying within recommended maximum host count for a vSAN Max cluster (24). |
Potentially faster resynchronizations because of less resource contention and distributing effort across more hosts. |
More rack space* used with fewer network ports used on ToR switches. *Comparison when using server size of the same physical form factor such as 1U, 2U, 4U. |
Lower likelihood of running into per host component limits (27,000) for vSAN ESA. | Fewer number of hosts in client cluster(s) that can mount the vSAN Max datastore while staying under the total host limit (128). |
May be able to meet recommended cluster minimums for desired topology and resilience levels. (ex: Stretched clusters using secondary levels of resilience through RAID-6) | |
Higher aggregate performance across the cluster because of a higher concentration of workloads (and their working sets) on any given host. |
“Capacity density” simply represents the number of storage devices in a storage node multiplied by the amount of capacity for each device. While “hosts with higher density of resources per host” and “hosts with lower density of resources per host” are not specifically defined, the comparison above will help provide some general understanding of advantages and disadvantages of the two approaches if one chooses to deviate from the ReadyNode Sizer results.
Understand the difference in endurance for storage devices in vSAN Max ReadyNodes
If vSAN Max ReadyNodes provide options with storage devices that advertise multiple endurance ratings, selection of devices with the most appropriate endurance rating for your environment should be a part of your design. While vSAN 8 U2 and later will include a Skyline Health finding that will monitor endurance of these storage devices, your workloads and environment may be best suited for one endurance rating over the other. The vSAN ReadyNode sizer can help account for this design decision. See the post: “Expanded Hardware Compatibility for vSAN Express Storage Architecture” for more information.
Ensure all vSAN ReadyNodes used for vSAN Max include a Trusted Platform Module (TPM) device.
This will ensure that the keys issued to the hosts in a vSAN Max cluster using Data-at-Rest Encryption are cryptographically stored on a TPM. This will guarantee that the host will have the required keys during host restarts even if the key provider is unavailable. If you are not planning to use vSAN Encryption services, including TPMs in the hosts at the time of purchase is an affordable and prudent step to future configuration options. For more information, see the “Key Persistence” topic in the vSAN Encryption Services document.
Use the vSAN ReadyNode Sizer™ to meet capacity requirements.
The vSAN ReadyNode Sizer can help you to properly compose a storage cluster solution to meet all your requirements. It will produce the calculations necessary to account for capacity overheads for the ESA to make the sizing process easy and predictable.
Much like vSAN HCI cluster sizing, vSAN Max will provide and advertise its capacity in raw form, where the total capacity advertised by a cluster is the aggregate total of all the storage devices claimed by vSAN ESA. Since different levels of resilience can be applied using storage policies, this means that the amount of data consumed on the datastore will be based on the storage policy, and other overheads. For example, a 100GB virtual disk with an assigned storage policy of FTT=2 using RAID-6 will consume about 150GB of raw capacity in the cluster. Since we recommend FTT=2 using RAID-6 for all vSAN Max clusters, the resilience overheads should be easier to estimate since this aspect of the data overheads remains consistent. See the posts “Demystifying Capacity Reporting in vSAN” and “Capacity Overheads for the ESA in vSAN 8” for more information.
Note that a vSAN Max cluster is intended for processing and storing data, not running guest VM workloads, which is enforced by the End User License Agreement (EULA). There may be system instantiated VMs like vSphere Cluster Services (vCLS) VMs and agent VMs used to power protocol service containers for vSAN File Services, but the intention of vSAN Max serve exclusively as a storage cluster. If one would like to run guest VM workloads while providing storage to another vSAN cluster, one can use a vSAN HCI cross-cluster capacity sharing feature.
Design the vSAN Max cluster with the intention of resource symmetry across hosts in the vSAN Max cluster.
While vSAN HCI and vSAN Max clusters can accommodate for asymmetry host configurations, the best cluster designs should strive for uniformity of all resources, including CPU, memory, network, and storage capacity across each host that comprises the vSAN Max cluster. See the post, “Asymmetrical vSAN Clusters – What is Allowed, and What is Smart” for more information.
Cluster Design and Sizing
vSAN Max single site cluster should consist of a minimum of 7 hosts.
This will provide two benefits.
-
Optimal Resilience. This will ensure that the cluster can support FTT=2 using RAID-6. While FTT=2 using RAID-6 only requires 6 hosts, a sustained failure of a host in a cluster consisting of just 6 hosts would result in an insufficient number of hosts to regain its prescribed level of resilience. 7 hosts would be able to automatically regain its prescribed level of resilience upon a sustained host failure. 7 hosts are also the minimum required host count for vSAN’s Auto-Policy Management feature to use RAID-6 when the cluster capacity management setting of Host Rebuild Reserve is enabled. For more information, see the post “Auto-Policy Management Capabilities with the ESA in vSAN 8 U1.”
-
Reduce percentage of impact with a sustained host failure. As shown in the illustration below, percentage of impact of a host failure becomes much smaller as the cluster host count increases. As illustrated in the graph below, a cluster with a minimum of 7 hosts would impact no more than about 14% of the storage resources (capacity and performance) in the event of a sustained host failure. Increase the host count reduces the percentage of impact even more – as low as about 4% for a 24 host vSAN Max cluster.
Figure. Recommended host count for a vSAN Max cluster.
vSAN Max stretched cluster should consist of a minimum of 14 data hosts.
A stretched cluster with at least 7 hosts in each of the two data sites would ensure the cluster could support a storage policy with a secondary level of resilience of FTT=2 using RAID-6 erasure coding, and allow vSAN to regain its prescribed level of resilience in the event of a sustained host outage. An additional host beyond the minimum required allows vSAN to reconstruct the stripe and parity in the most efficient way, with the fewest performance implications. See the post “Using the vSAN ESA in a Stretched Cluster Topology” for more information.
For all cluster topology types, the recommended maximum cluster size for a vSAN Max cluster is 24 hosts.
With the initial release of vSAN Max, VMware recommends a maximum of 24 hosts in a cluster, but 32 hosts in a vSAN Max cluster is the enforced limit. The host count for the vSAN Max cluster and any vSAN compute clusters cannot exceed 128 hosts in total, as shown below.
Figure. The maximum number of hosts participating in a vSAN Max cluster and client clusters.
Limiting the vSAN Max cluster size to 24 hosts will allow for up to 104 hosts from vSAN compute clusters to mount the datastore. A vSAN Max cluster size of 32 hosts would allow for up to 96 hosts from vSAN compute clusters to mount the datastore, still offering a very good compute to storage ratio of 3:1.
Consider your overall capacity needs when determining initial cluster size.
vSAN Max provides tremendous flexibility in incremental scaling of capacity and performance through simply adding more hosts to an existing cluster. Consider your overall capacity requirements and forecasts when determining the ideal host count in a cluster.
For example, in a single site environment, if you anticipate needing 4 PB of raw capacity immediately, and 4 additional PB in the next 18 months, consider creating vSAN Max cluster with 12 hosts to address the initial need, and a second vSAN Max cluster with 12 hosts for the additional expansion. This would allow for each respective cluster to easily grow by adding hosts because it is well under the recommended host count maximum for the vSAN Max cluster.
This approach would also allow more client clusters to mount the respective storage resources.
Client cluster compatibility considerations
vSphere clusters that wish to mount a datastore provided by vSAN Max will have a thin layer of vSAN installed on the hosts to provide the connectivity. This step in the configuration of the client cluster is what makes a vSphere cluster a “vSAN compute cluster” as stated in the UI. Installation of the software is an automated one-time process that occurs on each hosts in a vSphere cluster.
vSAN HCI clusters can also act as a client cluster, connecting to a vSAN Max datastore. At this time, only vSAN HCI clusters using the ESA can mount a vSAN Max datastore. This is because vSAN Max is built using vSAN ESA.
Understand cluster licensing
The new vSAN Max offering will be licensed separately from existing vSAN editions. vSAN Max will be offered as a subscription and planned to be licensed on a per Tebibyte metric, meaning that existing vSAN editions will not include vSAN Max. The vSAN Max license includes everything needed to run a vSAN Max cluster. No licenses of vSphere, etc. are needed.
vSAN Max requires vSAN ReadyNodes certified specifically for vSAN Max, and requires of minimum of 6 hosts. There will be no support for in-place upgrades/conversions from the other vSAN licensing types to vSAN Max.
For the most up to date vSAN Max pricing and packaging, please view the vSAN Licensing Guide.
Networking
Use fast networking from compute clusters to the vSAN Max cluster.
While 100Gb networking is required within the vSAN Max cluster, hosts from compute clusters may connect with as little as 10Gb networking. Unlike a cluster-capacity sharing model between two vSAN HCI clusters, the network uplinks for the compute may demonstrate less of a burden on the network since it does not need to support any local vSAN HCI traffic.
While the support of 10Gb networking from vSAN compute clusters to vSAN Max clusters provides a lot of flexibility for extending the life of existing servers, ideally the vSAN compute clusters will use 25Gb or higher network connectivity. This will help reduce the chance of the network from the vSAN compute cluster to the vSAN Max cluster being the bottleneck.
While RDMA is supported in vSAN HCI clusters powered by the ESA, it is not supported in disaggregated environments at this time.
Understand how topologies can change traffic in a spine and leaf network.
With smaller vSAN HCI clusters, vSAN storage traffic typically stayed within the Top-of-Rack (ToR) leaf switches. With larger vSAN HCI clusters, clusters using vSAN’s Fault Domains feature, or vSAN HCI clusters using cluster-capacity sharing, this traffic will traverse across the network spine.
Depending on the size of the environment, vSAN Max may also affect how traffic traverses across a network. For example, several other vSphere clusters residing in other racks will travers the network spine to connect to the vSAN Max cluster providing storage resources. Ensuring sufficient resources at the network spine will improve performance and resilience.
Figure. How topologies may affect network traffic.
Ensure proper networking connectivity between vSAN compute clusters and vSAN Max clusters.
Since the communication between compute clusters and vSAN Max clusters is latency sensitive storage traffic, we recommend simplified network connectivity between clusters. This means avoiding Firewalls and IDS/IPS systems that may inadvertently block this mission critical storage I/O in a manner that could cause substantial disruption. Network overlays are supported, but if one runs vSAN traffic through a VDS that is managed by NSX-T, use VLAN-backed port groups to prevent a loss of access to the host or the availability of VMs. For more information, see the video “vSAN Quick Questions – Can I run vSAN traffic through a network overlay, Firewall, IDS or NSX?”
With any type of storage traffic, redundancy of connectivity from end to end is important to ensure I/O is transmitted in a timely and reliable manner even in the event of a single network connection failure. While teaming multiple NICs in a host is common practice for all vSphere and vSAN environments, ensuring redundancy from the hosts in the server cluster (vSAN Max cluster) to the client clusters (vSphere clusters) and the connecting switch fabric will provide a robust environment.
Day-0 Initial Deployment and Configuration
vSAN Max Cluster services configuration
Preparing the vSAN Max cluster for its initial configuration.
Even though a vSAN Max cluster does not host any user-created guest VMs, some vSphere configuration settings are necessary for proper functionality. Prior to initiating any vSAN Max configuration workflow, please ensure the following:
- Use vDS in cluster configuration. Ensure that virtual Distributed Switches (vDS) are used with all relevant VMkernel ports configured in the cluster. vDS functionality is available as a part of the vSAN Max license. Recommendations on network configuration choices such as NIC teaming generally align with guidance provided for vSAN clusters. The vSAN Design Guide has a “Network Design Considerations” section that provides an overview of recommendations, with more extensive information provided in the vSAN Network Design Guide.
- Ensure that DRS and HA are configured. These services are available as a part of the vSAN Max license.
- Ensure that a vMotion is configured. The configuration of VMkernel ports with vMotion traffic tagged will help ensure mobility of management VMs.
Configure a new cluster for vSAN Max.
Configuring a new cluster to serve as a vSAN Max cluster is easy. Simply create a new cluster and name it as desired to complete the workflow. Do not enable vSAN in this initial workflow. Once the cluster is created, highlight the cluster, and click Configure > vSAN > Services. You will be presented with three options.
- vSAN HCI. This creates a traditional vSAN HCI cluster.
- vSAN Compute Cluster. This creates a vSphere cluster that can be used to connect to a vSAN Max cluster.
- vSAN Max. This creates a vSAN Max cluster.
Simply select “vSAN Max” and choose if you want it to be a single site vSAN Max cluster, or a stretched vSAN Max cluster, as shown below.
Figure. Initial configuration of a vSAN Max cluster.
Proceed with completing the workflow, which will present a few more options, such as Encryption services, and Auto-Policy management.
Enable the vSAN “Operations Reserve” and “Host Rebuild Reserve” toggles for single site vSAN Max clusters.
When enabled, this helps ensure there is sufficient free space in the cluster for internal operations and to rebuild data in the event of a sustained host failure. Note that when Host Rebuild Reserve is enabled, and paired with the Auto-Policy Management feature, it will require one additional host beyond the absolute minimum required by the storage policy. This is why we recommend 7 hosts at minimum for a single site vSAN cluster, where data can be stored using the highly resilient and space efficient FTT=2 using RAID-6, while still having a spare fault domain to regain prescribed levels of resilience in the event of a sustained host failure. See the post “Understanding ‘Reserved Capacity’ Concepts in vSAN” for more information.
The Operations Reserve and Host Rebuild Reserve toggles can be enabled by highlighting the cluster, clicking Configure > vSAN > Services > Reservations and Alerts as shown in the image below.
Figure. Enabling the Host Rebuild Reserve and Operations Reserve Toggles in vSAN.
Enable the vSAN “Auto-Policy” management feature on all topology types when using vSAN Max.
This will ensure optimal levels of resilience and space efficiency for data stored on a vSAN Max cluster. A cluster-specific default storage policy will be created and tuned for the cluster based on the host count, topology type (ex: single site, stretched, vSAN Fault Domains), and if the Host Rebuild Reserve is enabled or not. For stretched clusters, Auto-Policy Management will also ensure that a secondary level of resilience is applied to the default storage policy, improving resilience. Be aware that Auto-Policy Management may suggest a new storage policy rule setting that may impact many VMs. Skyline Health for vSAN will provide a health finding that will assist in the change of this default storage policy, and vSAN will manage the rate that the VM objects will be changed to the new policy setting.
See the post “Auto-Policy Management Capabilities with the ESA in vSAN 8 U1” for more information.
The Auto-Policy Management feature can be enabled by highlighting the cluster, clicking Configure > vSAN > Services > Storage > EDIT as shown in the image below.
Figure. Enabling Auto-Policy management in vSAN Max.
Note that if vSAN Max clusters are used across multiple vCenter Server instances, where the client cluster managed by a different vCenter Server than the vSAN Max cluster, the object’s storage policy assignment is controlled by the vCenter Server the object is managed from (e.g. the compute cluster). Therefore, the cluster-specific storage policy created by Auto-Policy Management will not be available for use in this circumstance. See the Storage Polices section for more information on this topic.
Enable the ”Automatic Rebalance” cluster setting on all topology types when using vSAN Max.
This toggle will tell vSAN to rebalance data to reasonable levels of symmetry if a host or device if capacity disparities exceed its thresholds. A more evenly balanced distributed storage system like vSAN Max will perform more consistently when resources are consumed in a balanced manner. See the post “Should Automatic Rebalancing be Enabled in a vSAN Cluster?” for more information.
The Automatic Rebalance feature can be enabled by highlighting the cluster, clicking Configure > vSAN > Services > Advanced Options > EDIT as shown in the image below.
Figure. Enabling Automatic rebalance in vSAN Max.
Ensure the Customer Experience Improvement Program (CEIP) is enabled.
The CEIP enables VMware to provide additional benefits to its customers through anonymized telemetry data. While it has been enabled by default for many versions of vSphere, double checking that this is enabled will help VMware provide the highest levels of product support. See the document “vSAN Support Insight” for more information.
The status of the CEIP can be viewed in the vSphere Client by clicking on Administration > Deployment > Customer Experience Improvement Program. Skyline Health for vSAN will also produce a health finding alert if it is not enabled on the vCenter Server instance, as shown below.
Figure. CEIP enabled verification using the vSAN Support Insight health check.
vSAN Compute Cluster
Creating a vSAN Compute cluster.
A vSAN compute cluster is simply a vSphere cluster that has a thin layer of vSAN installed for the purposes of mounting the remote vSAN Max datastore. Once a vSphere cluster is created, highlight the cluster, and click Configure > vSAN > Services. You will be presented with three options.
Select “vSAN Compute Cluster” and complete the workflow, as shown below.
Figure. Enabling connectivity from a vSAN Max cluster to a vSphere cluster.
The hosts in a vSphere cluster attempting to mount a vSAN Max datastore must be running vSphere 8 or later.
Mount a vSAN Max datastore to a vSAN Compute Cluster.
Once a vSphere cluster is configured as a vSAN compute cluster as shown above, one can easily mount the remote datastore provided by the vSAN Max cluster. One can highlight the vSphere cluster, click Configure > vSAN > Services > Mount Remote Datastores as shown below, or find the same ability to mount the remote datastores in Configure > vSAN > Services > Remote Datastores.
Figure. Mounting a datastore provided by vSAN Max.
It will then present the available vSAN Max datastore(s) that are eligible to mount. Select the desired remote datastore, and click “Next.” A compatibility check will be performed before the workflow completes.
Figure. Selecting the desired vSAN Max datastore for use with a vSphere cluster.
The datastore is now ready for use.
Be aware of the client cluster count limit for a vSAN Max datastore.
Design your vSAN Max cluster with the knowledge that the maximum number of client clusters is 10, as shown below. Client clusters can be vSphere clusters, also known in this context as "vSAN Compute clusters,” and vSAN HCI clusters.
Figure. Maximum number of client clusters that can connect to a vSAN Max cluster.
The number of client clusters connecting to a vSAN Max datastore may not exceed 10. This limit is tightly coupled with the total host count participating in a vSAN Max cluster any vSAN compute clusters. This limit is 128 hosts in total, as described earlier in this document.
A Few reminders about cluster types:
- vSphere clusters. These provide a collection of cluster-specific resources, such as compute and memory for VMs. Storage resources are provided by an external shared storage solution.
- vSAN HCI clusters. These provide a collection of cluster-specific resources, such as compute, memory, and storage for VMs. When one is not using any type of cluster capacity sharing capability within vSAN, the storage is treated as an exclusive resource of the cluster.
- vSAN Max clusters. These provide a collection of storage resources for VMs residing in other clusters, acting as a shared storage solution for vSphere clusters, and even vSAN HCI clusters.
vSphere clusters can be comprised of between 1 and 96 hosts. Since vSAN Max disaggregates, or decouples storage from compute resources, one can create vSphere clusters a size that best meets the needs of the organization, and design the compute clusters to reflect the computational requirements of the applications, leaving the storage responsibilities up to vSAN Max.
Design of compute clusters connected to vSAN Max is no different than designing vSphere clusters using another external shared storage solution. vSphere cluster design is a lengthy topic with considerations outside of the scope of this document, many of those same principles apply.
Ensure proper APD failure response in vSphere HA configuration.
Any vSphere cluster and acting as a “client cluster” that mounts a vSAN Max datastore must have the proper response to an “All Paths Down” (APD) failure. When enabled and configured correctly on the client vSphere cluster, the isolation events related to the connectivity between the client and the server cluster, or within the client cluster will result in the VM on the client cluster being terminated and restarted by HA. The APD failure response can be set to “Power off and restart VMs -– Aggressive restart policy” or “Power off and restart VMs -– Conservative restart policy.” This HA cluster setting is not required for vSAN clusters that do not participate in vSAN’s disaggregation offerings.
Figure. Configuring HA for a vSphere cluster using a vSAN Max datastore.
While a vSAN Max cluster uses the vSAN network for HA heartbeats, the connecting compute clusters continue to use the vSphere management network for HA heartbeats, and not the vSAN network configured on the compute cluster.
Storage Policies
vSAN Max clusters should use storage policies that provide the highest levels of space-efficient resilience.
This translates to using a failures to tolerate (FTT) of 2, using space efficient erasure coding. The recommendations below show the minimum number of hosts in the cluster to support those resilience levels. See the “Cluster Sizing Guidance” section in this document for more guidance on recommended cluster sizes for vSAN Max.
- Single site cluster or single sites using the vSAN Fault Domains feature. Use a storage policy with FTT=2 using RAID-6.
- Stretched clusters. Use a storage policy that provides site mirroring for site-level resilience, paired with FTT=2 using RAID-6 for a secondary level of resilience.
Enabling Auto-Policy Management in the cluster will ensure that the default storage policy is automatically configured using the highest-level space efficient resilience possible for the cluster. For more information, see the post “Auto-Policy Management Capabilities with the ESA in vSAN 8 U1.”
Do not use RAID-1 mirroring.
When using the vSAN ESA in vSAN HCI and vSAN Max clusters, RAID-6 erasure coding is faster than RAID-1 mirroring. RAID-1 mirroring will consume much more capacity than RAID-6. The only time RAID-1 is an acceptable option is with stretched clusters, where it is employed through a storage policy to mirror data across sites, while providing secondary levels using RAID-6.
Leave compression enabled in all vSAN Max topologies.
vSAN Max uses the ESA’s compression mechanism, which is controlled by storage policy. Leaving compression enabled will yield additional capacity efficiency while having little to no impact on storage performance. See the post “Using the vSAN ESA in a Stretched Cluster Topology” for a better understanding of why compression should remain enabled.
Understand storage policy behavior of VMs in vSAN compute clusters managed by different vCenter Server than vSAN Max cluster.
Storage policies are a construct of a vCenter Server instance. Currently there is not a way to provide storage policy management across multiple vCenter Server instances. When a VM on the client cluster managed by one vCenter Server is using the storage on a server cluster managed by a different vCenter Server, only one SPBM policy will take effect: The policy that is being used by the vCenter Server the object is being managed from. The SPBM engine on the other remote vCenter Servers will not see this VM so the policies on those vCenter servers will not impact the VM.
Figure. Storage policy usage in a cross-vCenter Server connection.
Auto-Policy Management enabled for a vSAN Max cluster will not apply this cluster-optimized default storage policy to any client cluster managed by another vCenter Server instance. One will need to select the appropriate RAID-6 storage policy on the vCenter Server instance managing the client cluster.
Day-2 Operations and Optimizations
Cluster updates and patching
Use “Ensure Accessibility” when entering a host in a vSAN Max cluster into maintenance mode.
vSAN Max will support the use of durability components which will not only improve the availability of the most recently written data planned and unplanned host outages, but they dramatically reduce the time it takes to update data after the maintenance event completes. Given that all data in a vSAN Max will be protected through storage polices with FTT=2 using RAID-6, this makes the data under maintenance events extremely resilient. Full evacuations of a host for maintenance purposes are largely unnecessary, and just consume valuable resources. Full evacuations may make sense if you are choosing to permanently remove a host from a vSAN Max cluster.
Become familiar with your server vendor’s offering for VMware’s vSphere Lifecycle Manager (vLCM).
Ensure that you have downloaded and installed your vendor’s plugin for vLCM known as a Hardware Support Manager (HSM), so that the proper desired state image can be created using the appropriate combination of hypervisor version and vendor drivers and firmware. This will make maintaining a vSAN Max cluster easy and predictable. If you are unfamiliar with vLCM, see the topic “Introducing vLCM into an Existing Environment” in the vSAN Operations Guide.
Scaling
Understand how to add resources to an existing vSAN Max cluster.
Resources in a vSAN Max cluster can be scaled easily, and incrementally. Adding more resources to a vSAN Max cluster will involve one of the two methods below.
- Scaling out. This simply means adding more hosts to a vSAN Max cluster. More hosts is an easy way to distribute the existing workloads across more storage resources, and improve the aggregate capacity and performance resources to the data.
- Scaling up. This means adding more or higher density storage devices within the existing hosts that comprise a vSAN Max cluster. See the "Cluster Design and Sizing" section of this document to learn about strategies for growth by adding hosts to clusters.
Depending on your environment, existing hardware configurations, operational procedures, time, procurement processes and hardware availability, one option may be more suitable for you than another. As a vSAN Max cluster is scaled, we recommend striving for uniform levels of resources in both performance and capacity across all hosts in the vSAN Max cluster. See the post “Asymmetrical vSAN Clusters – What is Allowed, and What is Smart” for more information.
Performance optimizations
If using multiple virtual disks in a VM, configure multiple paravirtual SCSI adapters in a VM's virtual hardware configuration.
This helps the guest operation systems ability to queue additional I/O. The use of multiple VMDKs using multiple paravirtual SCSI adapters in a VM’s virtual hardware configuration has been a common recommendation by Independent Software Vendors (ISV) for running their applications optimally in VMs on most storage systems. This recommendation applies to vSAN Max as well. See the "Applications" section in the Troubleshooting vSAN Performance guide for more information.
Monitoring and Event Handling
Learn how to view remote datastores connected to vSAN Max.
For remote datastore connections between client clusters and vSAN Max managed by the same vCenter Server, one can find this by highlighting the vSAN Max cluster, clicking Configure > vSAN > Remote Datastores, as shown below.
Figure. Viewing remote datastores connected to vSAN Max.
Remote datastores can also be viewed by highlighting the vCenter Server instance, clicking on Configure > vSAN > Remote Datastores, as shown below. This view is helpful for connections between vSAN compute clusters and vSAN Max clusters managed by different vCenter Server instances.
Figure. Viewing remote datastores at the vCenter Server instance level.
Monitor Capacity usage to ensure sufficient capacity.
For single site vSAN Max clusters, enabling the vSAN “Operations Reserve” and “Host Rebuild Reserve” toggles dramatically improve the ability to ability to maintain sufficient free space for transient vSAN Max operations, host failures, and incremental growth. One may wish to customize the capacity warning and error thresholds to suite the needs of the environment.
Host Failures and Remediation.
Much like vSAN HCI clusters, vSAN Max clusters have mechanisms in place to ensure that the data stored maintains availability, and durability as prescribed by the applied storage policies.
Upon a host failure or isolation event, vSAN Max will wait before rebuilding the data elsewhere to regain the prescribed levels of compliance of the data. By default, it waits for 60 minutes to determine if it is a transient event that corrects itself, or a sustained event that requires a rebuild elsewhere. In most cases, leaving this timer at its default is advised. However, two considerations may influence a desire to adjust this setting.
- Resource density of the host configuration. Hosts configured with higher capacities may take longer to resynchronize data elsewhere upon a failure. Using higher storage capacities may change the desired time you wish for it to wait prior to initiating a rebuild.
- Failed host replacement workflows. Some environments use operation run books to adhere to service level agreements (SLA). To adhere to these clearly defined SLAs, these organizations have procedures in place that will replace the entire host regardless of the failure type, such as a fan failure. If an environment uses this type of an approach, the object repair timer may need to be adjusted to fit this workflow and the operating SLAs.
The object repair timer can be adjusted by highlighting the cluster, clicking on Configure > vSAN > Services > Advanced Options > Object repair timer.
Note that resynchronizations only are for the data stored, not the data capacity provided. For example, if a host is storing 10TB of data but it can provide 100TB of capacity, the synchronization will only be for the 10TB of data.
Storage Device Failure and Remediation.
The impact of a storage device failure in vSAN Max will be limited to the failed device. vSAN Max will reconstruct, or resynchronize this data elsewhere in the cluster to regain its prescribed level of resilience. For this type of failure, since the boundary of failure is limited to just the storage device in question, a relatively small amount of resynchronization will be performed. See the post: “The Impact of a Storage Device Failure in vSAN ESA versus OSA” for more information.
When replacing the device, it will be important to understand how to identify the physical location of the failed device so that the correct device is replaced. vCenter Server allows you to highlight a desired storage device and turn on the device locator LED to correctly identify a specific device within a server. For any type of vSAN cluster, this can be found by highlighting the cluster, clicking Configure > vSAN > Disk Management, highlighting the host, clicking View Disks, highlighting the desired disk, and clicking on “Turn on LED” as shown below.
Figure. Managing storage devices in a vSAN Max cluster.
This functionality may be dependent on the capabilities of the server, the storage device, and prerequisite software from the server manufacturer to work. It is recommended to test this functionality in a server prior to entering it into a production environment.
Summary
VMware believes that aggregated vSAN HCI clusters and disaggregated vSAN Max clusters serving vSphere clusters can provide a powerful combination to your enterprise needs. The recommendations above will help customers achieve the highest levels of performance, resilience, and operational simplicity for their environments powered by vSAN Max.
Additional Resources
The following are a collection of useful links that relate to vSAN Max, and vSAN ESA.
vSAN ESA Quick Link. This is a central repository of ESA-related content. It includes blog posts, FAQs, technical deep dives, interactive infographics, podcasts, and videos on vSAN as it relates to the Express Storage Architecture.
Performance Recommendations for vSAN ESA. This is a collection of recommendations to help achieve the highest levels of performance in a vSAN ESA cluster. Many of these same recommendations apply to vSAN Max.
vSAN Proof of Concept (PoC) Performance Testing. This is a collection of recommendations that will guide users to test the performance of a vSAN cluster. While it is currently written for the OSA, many of the testing methods used are also applicable to the ESA.
Design and Sizing for vSAN ESA clusters. This post offers some nice guidance on using the vSAN Sizer for the ESA that summarizes some key points that can be found in the VMware vSAN Design Guide.
vSAN Network Design Guide. This network design guide applies to environments running vSAN 8 and later.
About the Author
Pete Koehler is a Staff Technical Marketing Architect at VMware. With a primary focus on vSAN, Pete covers topics such as design and sizing, operations, performance, troubleshooting, and integration with other products and platforms.