Running Tier-1 Apps with HCI Mesh Solution Overview
vSAN HCI Mesh Delivers Efficency For Changing Application Requirements
VMware® vSAN™ is the industry-leading software powering HyperConverged Infrastructure (HCI) solutions. vSAN is optimized for VMware vSphere® virtual machines and natively integrated with vSphere. Since drives internal to the vSphere hosts are used to create a vSAN datastore, there is no dependency on expensive, difficult to manage, external shared storage.
The Scale-out architecture of VMware vSAN enables powerful non-disruptive scale-up or scale-out capabilities. You can non-disruptively expand capacity and performance by adding hosts to a cluster (scale-out) or just grow capacity by adding disks to a host (scale-up). As application workloads organically grow, this enables performance to be right-sized at each expansion interval. Over time the ratio of storage and compute can be right-sized through vSAN. Despite this, inorganic scale events can prove challenging to any architecture. Examples include:
• An application refactor requires significant storage being added for logs.
• A new line of business for analytics may consume excessive compute, potentially stranding storage assets.
• M&A may result in new business units bringing unforeseen storage requirements to a cluster.
• The deployment of Networking Virtualization, powered by NSX, has enabled the migration and consolidation of legacy stranded cluster resources.
vSAN HCI Mesh Architecture
VMware vSAN HCI Mesh™ can be implemented with as few as two clusters. A local cluster, a cluster providing the storage, and a remote cluster that provides compute for the workload. All clusters are deployed within the same datacenter. The diagram below shows what the high-level architecture might look like for an organization with 3 clusters. In this example, one cluster is providing storage to two remote clusters. It is additionally possible for clusters to both receive and shares storage with another given cluster. For more technical information and requirements see the vSAN HCI Mesh Tech Note.
Since the local cluster providing storage must handle external storage traffic as well as intra-cluster replication traffic, network connectivity requirements are critical. 25Gbps is strongly recommended for clusters that will be sharing storage with remote clusters.
Hosts at the remote clusters should be in the same datacenter and ideally be directly connected to the cluster hosting the storage. This reduces the opportunity for network contention, minimizes complexity, and improves reliability.
vSAN is built on the concept of fault domains. The use of HCI mesh expands the fault domain for a virtual machine from it's the local cluster to potentially two clusters. HCI Mesh does provide for separating the updates of vSphere of the cluster running the compute for a virtual machine, from the update of the cluster powering the storage.
Configuring vSAN HCI mesh is simple. The vSphere Web Client is used to mount a remote datastore.
HCI Mesh Tier 1 Application Testing
Today, VMware vSAN is commonly used for traditional tier 1 applications. As part of validating VMware HCI Mesh, several tier 1 applications were tested to better understand the scaling and performance characterizes of HCI Mesh. The following applications were tested:
The purpose of this test was to see if HCI mesh would enable existing business-critical applications that run on VMware vSAN today, would be able to run on a remote compute-only cluster. The focus on the test specifically was on the impact on compute and networking overhead of running these applications on a "local cluster" vs. a "remote cluster" consisted of like for like compute only nodes.
"local cluster" refers to the cluster exporting storage, while "remote cluster" refers to the compute-only cluster that simply consumes the remote vSAN datastore. For this testing RAID 1 was used with no additional data services to highlight the impact on the performance of the remote connections and compute load on the hosts.
Each host contained Intel Xeon Platinum 8260 CPU with 24 cores running at 2.4Ghz were used. This is a popular CPU for highly demanding business-critical applications. 768GB of RAM were configured for each host, providing for adequate memory bandwidth for the applications tested.
like for like 100Gbps NICs were used, that ran at either 10Gbps or 40Gbps based on the switch they were connected to. This allows the removal of concern of unique NIC firmware/driver impact on performance between the cluster runs.
Two disk groups were used, with 3 capacity disks. The workloads were run with working set sizes that should fit within the cache devices to focus the testing on VMware HCI Mesh's additive impact on performance and latency.
Local and remote testing performance was found to be "on par" when sufficient networking bandwidth was deployed.
Running the Kafka producer test resulted in a benefit for the local cluster within 5% even for the heavy cluster test.
Running the Kafka consumer test resulted in effectively the same performance.
Running the HCI Mesh on a constrained 10Gbps network resulted in a 10-18% net benefit for the local cluster.
SAP Benchmarks - Network Performance Is Key
Note: SAP HANA is not currently certified for use with HCI Mesh.
SAP benchmarks that focus on log read and write performance. Some observations found were:
- large throughput reads did not benefit significantly from the client cache in the remote hosts.
- 10Gbps networking proved to be a bottleneck. The local cluster saw net benefits as high as 30-40%.
- 40Gbps networking proved to remove networking as a bottleneck. The local cluster saw net benefits as low as 3-10% on many tests.
- Net latency added for the remote cluster was ~80-100us, consistent with an additional TCP hop.
Synthetic Testing Observations
Large Block Testing - networking is key
10Gbps networking - The local cluster was found to be faster by 30-50%. This is due to the saturation of the network.
40GBps networking - The local cluster lead was reduced to 8-13% benefit for large block writes, and comparable performance for large block reads.
Storage light workloads that rely on small block read work can scale effectively with 10Gbps. Write heavy and large block read workloads can expose 10Gbps networking for the clusters providing storage. These hosts must provide both backend replication traffic, as well as front-end storage traffic to the remote compute cluster.
Small Block Testing - Could the remote cluster be "faster?"
In small block write tests, local and remote cluster performance was comparable. Synthetic testing showed a slight bias towards the remote cluster.
In testing small block read workloads the remote cluster was found in some situations to be up to ~20% faster. This was discovered specifically on synthetic testing, as well as SQL testing.
Why is this? In this case, it was a result of:
- A complete lack of contention with storage data services, on a workload that was otherwise configured and deployed to consume 100% of the host's CPU.
- A workload that was not taxing the networking, or exposing the cluster to cluster network connectivity as a bottleneck.
- In these cases, the local cluster hosting the storage was otherwise idle. In effect twice as much compute was being provided to the entire solution.
- Read workloads that can fit within the DRAM-based client cache will not need to be read from the local cluster providing storage.
Performance for applications running in the remote cluster was largely comparable to the local cluster when compute was the primary bottleneck. For workloads that fit within the storage network's ability to deliver there is no significant advantage to placing the workload local or remote.
VMware Recommends - Deploy 25/40/100Gbps networking (or faster) for clusters needing to deliver high performance. Either LACP or larger interfaces are particularly important for the cluster providing storage to the remote clusters.
Application Testing Conclusions
In general, workloads that do not significantly put hosts under extreme CPU load, Push the limits of the storage network the latency and transactions were found to be comparable (within 2-15%) between the local and remote cluster. Additional VMware CPU Efficiency features (DRS) and tuning of workloads (designing around NUMA node placement etc) are likely to have a great influence on application performance than the choice of local or remote for these non-outlier virtual machines. Simple decisions such as deploying a higher clock and more core-rich CPUs, and 25/40/100Gbps NICs and switching will likely outweigh any overhead to this feature itself.
For best practices for Tier 1 applications check out the specific application VMware vSAN solution briefs and reference architectures.