The Compelling Case for HCI Mesh Compute Clusters

November 19, 2021

VMware's HCI Mesh feature in vSAN is a great example of how a single feature can immediately improve how resources across an environment are consumed.  It initially debuted in vSAN 7 U1 and allowed a vSAN cluster to borrow storage resources from other vSAN clusters.  vSAN 7 U2 extended this capability by enabling traditional vSphere clusters to use storage resources provided by a vSAN cluster.  The latter is known as an “HCI Mesh Compute Cluster” and is the focus of this post.

Before we look at the power of HCI Mesh compute clusters, let's revisit how HCI Mesh came to be, and why it works so well.

Why Disaggregation at the Cluster Level Makes the Most Sense

Before vSAN 7 U1, customers asked for the capability to add discrete compute-only or storage-only nodes to a cluster.  It is not unusual for customer requests to conflate a feature request with how it should be achieved, as was the case here.  Distilling the request down to its fundamental problem statement revealed that customers wanted to have an ability to optionally separate or disaggregate storage and compute resources in an HCI environment for more flexibility and better utilization.  The next question is how to achieve it.  Unfortunately, disaggregation through storage-only or compute-only nodes leaves a lot to be desired.

Dedicated storage-only or compute-only nodes in a distributed system are wrought with challenges in both healthy and unhealthy conditions.  Figure 1 illustrates that in normal operating conditions, balancing compute, storage, and network resources can be difficult as it inevitably places unnecessary strain on some systems versus others, and potentially results in inconsistent performance.  In failure conditions where perhaps one or more storage-only nodes in a cluster become unavailable, this may take out a large percentage of the available capacity of the cluster, making the rebuilding of data to other nodes in the cluster difficult because there may not be sufficient resources to do so. 

The challenge of balancing resources inside of a cluster

Figure 1. The challenges of asymmetry in healthy and unhealthy conditions.

Resilient distributed systems and clustering are most effective and efficient when the resources are reasonably uniform across the nodes that comprise the system.  Compute-only or storage-only hosts within a distributed system undermine this principle.

Since the storage resources in a vSAN cluster are presented as a single unit (a datastore), VMware believed that the better way to implement storage/compute disaggregation was at the cluster level.  With this approach, where VMs in a "client" cluster can consume storage from a "server" vSAN cluster, the administrator knows that the server cluster is already sufficiently designed.  It was already built to tolerate failures and can provide the necessary data services to VMs in other clusters just as it does for VMs in the vSAN cluster. 

Providing disaggregated storage at the cluster level also avoids some of the fragility issues that can occur with a storage-only or compute-only node approach, because the complexity of certain host failures versus other host failures does not need to be considered.  The result is simplicity and agility for the administrator managing resources and accommodating change.

Most importantly, HCI Mesh does not compromise the power and technical advantages of vSAN treating storage as a resource of the cluster.  Serving up storage as a cluster resource is extremely powerful, and that still stands as the default behavior of vSAN cluster deployments.  HCI Mesh simply compliments that capability with the additional flexibility of offering a way to consume storage resources from another cluster in a purposeful way.

Why HCI Mesh Works so Well

Much of the credit for the power and robustness of HCI Mesh goes to the architectural approach use by VMware to present vSAN storage to other clusters natively.  vSAN is comprised of various layers with different responsibilities.  This stack was developed for the specific needs of a distributed storage system.  HCI Mesh uses the same componentry in the stack to maintain the native capabilities that one would see if VMs in a vSAN cluster were consuming its own storage resources.

Why is the native vSAN stack used instead of some other method of sharing storage?  Figure 2 helps illustrate the reason.  The left side of Figure 2 illustrates a purely theoretical approach of presenting vSAN storage to another cluster using NFS or iSCSI.  These approaches would have complicated the data path and inhibited performance, efficiency, scalability, and compatibility objectives.   Losing functionality and increasing complexity through one of these other theoretical approaches was antithetical to the design objectives. 

HCI Mesh Design approach

Figure 2.  Comparing a theoretical approach of presenting vSAN storage to a remote cluster, versus the native vSAN method used in HCI Mesh.

The right side of Figure 2 illustrates how HCI Mesh was implemented.  For an HCI Mesh compute cluster, a thin layer of vSAN is enabled on the hosts in the client cluster.  This serves the same responsibilities as the top layers of the vSAN stack in a traditional vSAN cluster, which maintains both an efficient control path and data path.  This efficiency not only minimizes overhead but provides full awareness of the data it is responsible for.  Communication build-up and tear-down, as well as per-object locking and quorums behave in the same as one would find in a traditional vSAN cluster.

The Power of Compute Clusters

HCI Mesh isn't just for environments with multiple vSAN clusters.  In vSAN 7 U2 and later, a traditional vSphere cluster can use a vSAN cluster as primary storage, or augment other external storage resources that a vSphere cluster uses.  Perhaps best of all, it doesn't require any additional licensing on the hosts in the vSphere clusters consuming the storage in the vSAN cluster. 

HCI Mesh Compute cluster topology

Figure 3.  vSphere clusters and vSAN clusters accessing storage from a “server” vSAN cluster.

When a vSphere cluster mounts a datastore provided by a vSAN cluster, the vSAN cluster in many ways behaves like a traditional storage array, but with enhanced levels of intelligence and flexibility found in a vSAN cluster.  Here are just a few benefits of this topology:

  • Mitigate capacity or bandwidth-constrained conditions on your storage array.  Relieve some of the pressure from your existing storage arrays by performing a storage vMotion over to a vSAN cluster that is remotely mounted by the vSphere cluster.  This will allow the vSAN cluster to take over all storage-related responsibilities of the migrated VMs.
  • Consume unused resources, lowering TCO.  Did you overbuild your storage capacity in your vSAN cluster or perhaps your CPU capacity in a vSphere cluster?  Use this stranded capacity by placing the VM instance and its VMDKs in the clusters that make the most sense.
  • Use as swing storage for upgrades and maintenance.  HCI Mesh can give you another datastore to use for upgrades, maintenance, and system replacement.
  • Examine your workloads on vSAN storage.  Are you interested in expanding your vSAN footprint (on-premises, in the cloud, or a hybrid of both) but want to ease into this transition?  Try some of your workloads on vSAN where the VM instance remains in the vSphere cluster, but the VM itself resides on vSAN storage to better understand how that transition could occur.
  • Use the latest hardware.  It is easy to build a vSAN cluster with the very latest hardware.  Take advantage of that all-flash vSAN cluster comprised of 100Gb networking, Intel Optane devices at the buffer tier, and NVMe devices at the capacity tier.
  • Take advantage of storage policies.  Use storage policies applied at the VM or even VMDK level to suit your resilience, efficiency, and performance requirements. 
  • Take advantage of cluster services.  Do you have a vSAN cluster using Data-at-Rest Encryption?  Simply storage vMotion a VM to the desired vSAN cluster, and now you have a VM powered by your vSphere cluster, but safely encrypted by the data services on the vSAN cluster.
  • Reduce application licensing.  Do you have a high-performance database that forces you to license every host in a cluster?  Easily create a 2-host vSphere cluster and mount a vSAN datastore to meet your storage needs while minimizing licensing costs.

VMware tackled the challenge of disaggregation in a smart way.  Using native vSAN protocols and presenting storage at the cluster level makes it behave exactly the way you'd expect it to work.  It gives an administrator a new tool to accommodate a classic problem: Accommodating for change that wasn't expected.

 

Recommendation:  Make sure your spine and leaf network can handle the potential shifting of network traffic when using HCI Mesh.  For more information, see the post:  How vSAN Cluster Topologies Change Network Traffic.

Summary

With vSAN 7 U2 and later, a vSAN cluster can easily be the storage array you never knew you had.  HCI Mesh is VMware's answer to providing more flexibility in the use of storage resources already deployed in other vSAN clusters.  Give it a try!  For more details, see the FAQs on HCI Mesh and the HCI Mesh Tech Note on core.vmware.com

@vmpete

Associated Content

From the action bar MORE button.

Filter Tags

Storage vSAN vSAN 7 Blog Deployment Considerations Technical Overview Intermediate