Performance Improvements in vSAN 6.6
Executive Summary
VMware vSAN 6.6 delivers a new level of performance and consistency of VMs running in VMware's integrated HyperConverged Infrastructure (HCI) platform. Driving additional performance entirely through a software update, with no additional costs in hardware or software, demonstrates the power of a software defined data center. The test results shared in this document show the ability for vSAN 6.6 to deliver a 50%, or greater, performance improvement under a variety of workload conditions, using common data services such as deduplication, compression, and data integrity features such as software checksum.
vSAN 6.6 not only demonstrates improved IOPS and throughput with decreased latency as compared to vSAN 6.5, but it does so with more consistent results. Lower and more predictable latencies are what application owners expect in delivering solutions and addressing their business objectives. Faster, more efficient delivery of storage in vSAN allows organizations to run more workloads per host while still meeting performance expectations.
Introduction
vSAN 6.6 introduced several improvements that aim to improve the efficiency and performance of vSAN. Some enhancements focused on data management improvements. These optimizations improved the efficiency of maintaining the health and balance of data on vSAN. These improvements have been detailed in the "VMware vSAN 6.6 - Intelligent Rebuilds" Technical Note.
This document focuses on the performance improvements as a result of enhancements that allow vSAN to deliver improved storage I/O performance when using the very same underlying hardware running vSAN 6.5. These improvements are not limited to just I/O activity related to read and write requests from the VM, but also data management activity such as resynchronization and rebuilds.
vSAN Testing Configuration and Conditions
Details the hardware and software settings used when running these tests in a controlled environment.
Hardware and vSAN Configurations
The objective of the testing described is to observe the comparative advantages of vSAN 6.6 over vSAN 6.5. The test methods used for evaluation comprise of a series of micro-benchmarks to evaluate the performance improvements of data services (e.g. checksum), space efficiency features (Deduplication & Compression) and fault tolerance methods (FTM) independently, and together. The battery of tests were run against a large working set of data, and a small working set of data. The combination of these conditions provide the best cross section of conditions necessary to evaluate the improvements in performance in vSAN 6.6.
Hardware and vSAN Configurations
The test conditions for all comparisons between vSAN 6.5, and vSAN 6.6 remained the same, and are as follows:
Hardware
- Physical Hosts: SuperMicro Sys-2028u-TNRT+ with Intel Haswell-EP E5-2670 v3 processors
- Host count: Four hosts comprising one vSAN cluster
- Disk groups: Each host consist of two disk groups. Each disk group has a single 400GB Intel P3700 NVMe flash device, and three 800GB Intel S3500 SATA SSD flash devices.
Software
- Hypervisor versions tested: vSphere 6.5 versus vSphere 6.5 EP2 (aka vSAN 6.6). In order to sufficiently accommodate the larger working sets, the vSAN 6.5 configuration in this test used adjustments similar to the manual settings described in “VMware vSAN 6.0 Performance: Scalability and Best Practices.”
- Worker VMs: A quantity of 8 VMs were used on each host, with one virtual disk per VM
- Data placement: Working set of data is split across the virtual disks on each host
- Data uniqueness: In any testing involving deduplication and compression, the workload buffer is configured for 0% of the data to be deduped, or compressed.
Working sets tested
Two sizes were used for working sets on each test scenario. The large and small working set sizes attempt to simulate two distinct conditions commonly found with production workloads. Large working sets reflect a considerable amount of active, or hot data relative to the size of any caching or buffering tiers. Small working sets reflect a smaller amount of hot data relative to the size of any caching or buffering tiers.
- Large working set configuration: This configuration consists of a working set size of 1680GB per host, and occupies about 60% of the total storage capacity of the vSAN cluster. With two disk groups per host, the 400GB NVMe drive in each disk group does not allow for the working set to fit in the 800GB buffer space per host. This creates a large, sustained amount of destaging activity after the test is ramped up.
- Small working set configuration: This configuration consists of a working set size of 200GB per host for random I/Os, and 400GB per host for sequential I/Os. This size allows the working set to fit within the capacity of the 2, 400GB NVMe buffering devices on each host. Just as with the large working set, the test data collected is are all based on steady state numbers.
The definition of “large” and “small” working set sizes are for the purposes of these test scenarios only. In production environments, determining working set sizes, and their respective cache coherency can be challenging. For this reason, the primary focus of the test results will be on the large working set tests, which will generate the most conservative test results.
Data Services and protection schemes tested
vSAN is an object based storage system. Unlike traditional storage systems, this allows data services such as software checksum, or a failure tolerance method (FTM) to be assigned as a storage policy on a per VM, or VMDK basis. Each test case was run under four distinct configurations. Table 1 shows the combination of testing used against vSAN 6.5 and vSAN 6.6. All testing used the same hardware, and the same test methods for each scenario.
Table 1. Conditions tested
Other combinations of data services and protection schemes exist, but this test was limited to the scenarios listed in Table 1.
Access patterns tested
Each test condition included the following I/O patterns.
- Random Read
- Random Write
- Mixed random Read/Write (70/30%)
Production workloads will often have a mix of these patterns throughout the course of a given time period. In most cases, the aggregation of multiple workloads across a collection of hosts will blend discrete patterns into a mix of random reads and random writes. This is the primary reason why mixed, random I/O is the most prevalent pattern in production, virtualized environments, and is the area of emphasis in the analysis of the test results.
Measurements
Testing of performance measures data activity in the form of I/O commands successfully completed per second (IOPS). IOPS is an aggregation, or sum of the total IOPS generated by worker VMs. Latency is the time in microseconds (us) or milliseconds (ms) and may be referred to for readability. Latency defines the time needed to complete a read or a write operation. Latency data reported is an average. Throughput, which is the amount of payload transmitted per second, is omitted for clarity. All random I/O tests used 4K for an I/O size. Test results provided in this document are reported as a percentage of change when comparing vSAN 6.5 to vSAN 6.6.
The test results in this document refer to data collected from an all-flash cluster. The performance improvements included in vSAN 6.6 will apply to both hybrid, and all-flash based vSAN clusters, but the degree of improvement will be different due to caching considerations, as well as the physical properties of spinning disks. Deduplication and Compression is a feature exclusive to all-flash based configurations and enterprise based vSAN licensing.
Performance Comparisons and Findings
Provides the test results in a number of different ways to demonstrate the performance differences between vSAN 6.5, and vSAN 6.6.
Large Working Set
Tests using a large working set are designed to simulate workloads that have a large amount of data that is active, thereby flushing out any recently accessed data preserved through caching, or buffering to a faster tier of storage. Characteristics of a large working set typically involves large amounts of sustained writes, with read requests for data that is well outside of the capacity of any buffering and caching tier.
Large amounts of sustained writes, which contribute to generating a large working set, generally represent the most lowest performance results for any storage system. Testing large working sets helps establish a minimum level of performance and consistency.
Overview of Improvement
The following test results summarize the changes in effective performance under the conditions outlined in the “vSAN Testing Configuration and Conditions” section in this document. Test results of IOPS and latency against multiple I/O patterns are measured as a percentage of improvement from vSAN 6.5 to vSAN 6.6. Smaller bars do not indicate less performance, but simply less improvement.
Figure 1 shows the percentage of performance improvement under a variety of different I/O patterns, using a large working set of data. It is easy to see that improvements are greater as more data services are used. The most significant improvements come with all tested data services (RAID-5 Erasure Coding, checksum, Dedup & Compression) enabled. All random patterned I/O (random read, random write, and mixed) demonstrated significant improvement in vSAN 6.6.
The mixed workload using all data services improved by 63.5%. The same workload, but using just checksum and RAID-5, improved by 56.3%.
Figure 1. Improvement in IOPS for vSAN 6.6 over vSAN 6.5 – Large working sets
The reductions in latency shown in Figure 2 generally correspond with IOPS improvements described in Figure 1. When using a large working set with a mixed workload using all data services, latency was reduced by 39.8%. This same test with using just RAID-5 and checksum yields a latency reduction of 37.4%. Notable improvements also show up in the random write test, where a latency reduction of 20.2% occurs when using all data services, and a 35.1% reduction when using just RAID-5 and checksum.
Figure 2. Reduction in latency for vSAN 6.6 over vSAN 6.5 – Large working sets
The reductions in latency are the best measurement to understand how real production workloads can benefit from the performance enhancements.
Results of common I/O patterns
For large working sets, the data above shows that the mixed workloads saw the most improvement. Two I/O pattern types will be examined in more detail to to better understand the results.
Figure 3 shows that for a 70/30 mixed workload using a large working set, a 28.5% and 63.5% increase in IOPS was observed when using data services. This allowed vSAN 6.6 to drive more IOPS with all data services than vSAN 6.5 using just RAID-5 and checksum.
Figure 3. Improvement of IOPS for a 70R/30W mixed workload with a large working set
In Figure 4, the latency improvements for those corresponding IO improvements are equally impressive, ranging between 22.5% and 39.8% reduction in latency. This allowed vSAN 6.6 to provide less latency for all services (Deduplication & Compression, RAID-5, and checksum) than vSAN 6.5 using just RAID-5 and checksum.
Figure 4. Reduction in latency for a 70R/30W mixed workload with a large working set
Long periods of random writes using a large working set is a challenging I/O pattern to improve upon. With the improvements shown in Figure 5, vSAN 6.6 was able to deliver more IOPS with using RAID-5 and checksum than the same test run on vSAN 6.5 using just checksum.
Figure 5. Improvement in IOPS for a random write workload with a large working set
This type of I/O scenario is particularly difficult to improve upon because it is largely dependent on the constraints and characteristics of the physical hardware. The reduction in latency shown in Figure 6 allowed vSAN 6.6 to deliver lower effective latency with using RAID-5 and checksum, than the same test run on vSAN 6.5 using just checksum.
Figure 6. Reduction in latency for a random write workload with a large working set
While large amounts of random write operations sustained across a long period of time can be difficult for any storage system, it is important to recognize the effective improvements made in reducing latency, especially when multiple data services are used.
Figure 5 and Figure 6 show that for tests running only checksum, the improvement was less significant than the improvement with other data services. The result in this specific test case is due to RAID-1 (mirroring) being constrained by the limit of components the I/Os are written to. In the “checksum” test, RAID-1 with checksum is written to two mirrored components. Writing data to two components triggered congestion, which reduced the level of improvement. In the “checksum with RAID-5” test, the data was striped across four components (3 data + 1 parity). Writing the data across four components eliminated the component congestion, and demonstrated a significant increase in IOPS, and reduction in latency.
Summary of Analysis
Several observations can be taken from the data provided of these tests against large working sets.
- All profiles of random I/O generation demonstrated significant improvement in performance.
- We can see clearly that in the case of mixed I/O patterns, vSAN 6.6 delivered over a 60% increase in IOPS, and nearly a 40% reduction in latency when comparing the same test between vSAN 6.5.
- vSAN 6.6 is often able to drive higher performance (more I/O, less latency), using a broader set of data services than vSAN 6.5 using a reduced set of data services.
- vSAN 6.6 demonstrates a consistent, reduction of impact on performance with the use of data services and protection policies.
Areas of benefit
Mixed workload patterns are the most common I/O profile found in virtualized environments, and where vSAN 6.6 reports some of the most significant performance improvements. Pairing this with applications that have large working sets, the following are some examples of environments and applications that may see the most benefit.
- Large batch processes from databases, and structured data sets. ERP systems often have this type of activity.
- Transactional applications. These are applications that have a serialized I/O dialog in which the next write is waiting on the previous read. Latency is critical in these types of applications.
- Unstructured data. File servers, and data warehouse environments.
- High density of VMs per physical host. This randomizes I/O patterns, and has a greater potential for contention as a result of bursts from more VMs.
- Latency sensitive applications. These often have strict service level requirements defined in deployment guides.
Small Working Set
Tests using a small working set of data are designed to simulate workloads that have a set of data that fits within a given boundary of allocated caching and buffering of the storage system. Characteristics of this small working set typically involves a blend of recently accessed data, read or written, that remains in the caching tier for subsequent reads. Modest sized working sets are more representative of a general-purpose workload that have relatively frequent, repeating tasks over a period of time, or “duty cycle.”
Despite small working sets not being representative of every workload, including them in a battery of tests serves a very specific purpose when testing software optimizations like those made in vSAN 6.6. In many cases, test results from smaller working sets better represent optimizations made to a software stack, as it is less reliant on characteristics and constraints of hardware components used in a specific test. Testing small working sets in conjunction with large working sets also helps demonstrate effective performance when application workloads and workflows are able to take advantage of cached content. Caching is an important aspect of any data center, and exists in compute, network, and storage.
Overview of Improvement
The following test results summarize the changes in effective performance under the conditions outlined in the “vSAN Testing Configuration and Conditions” section in this document. Test results of IOPS and latency against multiple I/O patterns are measured as a percentage of improvement from vSAN 6.5 to vSAN 6.6. Smaller bars do not indicate less performance, but simply less improvement.
Figure 7 illustrates substantial improvements across a variety of I/O profiles when using a smaller working set size. The most significant improvements are a result of when data services such as checksum, RAID-5, and Dedup & Compression are enabled. Just as with the testing with a large working set, all random patterned I/O (random read, random write, and mixed) demonstrated significant improvement in vSAN 6.6.
For a mixed workload, using all data services improved by 52.3%. The same workload, but using just checksum and RAID-5, improved by 86%. Particularly interesting is the improvement in performance of random writes. Using all data services, there was a 48% improvement in performance. The same workload, but using just checksum and RAID-5 showed a 74.4% improvement.
Figure 7. Improvement in IOPS for vSAN 6.6 over vSAN 6.5 – Small working sets
The improvements in latency illustrated in Figure 8 shows reductions that mirror performance increases of IOPS described in Figure 7. When using a small working set with a mixed workload, and all data services saw a 35.6% reduction in latency. The same test using just RAID-5 and checksum saw a 47.7% reduction. Latencies on random writes were also improved dramatically, where latency reductions of 33.4% were observed when using all data services, and a 44% reduction when using just RAID-5 and checksum.
Note that when no data services are used, there is less opportunity for improvement in data path optimization. An example of this is shown in Figure 8, but can also be found in other test results in this document. This specific test result shows a slight regression for random reads when no data services are used. The actual latency measurements for this specific test were 830us (microseconds) compared to 833us, which equates to a 0.36% increase in latency. These small variances (+/-) can be the result of fluctuating behaviors throughout the entire stack, including all hardware components, and fall well within the range of variability across multiple test runs, even under extremely controlled environments.
Figure 8. Reduction in latency for vSAN 6.6 over vSAN 6.5 – Small working sets
Latency reductions using the smaller working set can be representative of the effective performance improvement that some workloads in a production environment would see.
Results of common I/O patterns
Much like the large working sets, the data above shows that when using smaller working sets, mixed workloads saw the most improvement. Two I/O pattern types will be examined in more detail to to better understand the results.
For a 70/30 mixed workload using a smaller working set, we see between a 52.3% and 86% increase in IOPS when using data services. With the dramatic increases shown in Figure 9, this allowed vSAN 6.6 to drive almost as many IOPS when using all data services (Deduplication & Compression, RAID-5, and checksum) as vSAN 6.5 using just checksum.
Figure 9. Improvement in IOPS for a mixed workload with a small working set
The latency reductions for those corresponding IO improvements reinforce the performance gains shown in Figure 9. These latency reductions ranged from 35.6% to 47.7%. The amount of reduction in latency shown in Figure 10 allowed vSAN 6.6 to run all data services (Deduplication & Compression, RAID-5, and checksum) at about the same latency to vSAN 6.5 using just checksum.
Figure 10. Reduction in latency for a random write workload with a small working set
Random writes using a smaller working set exposes more of the performance gains built into vSAN 6.6 than the same test using a large working set. As shown in Figure 11, improvements ranged between 25.3% and 75.5% when using data services.
Figure 11. Improvement in IOPS for a random write workload with a small working set
Latency reductions for random writes were significant. When using data services, Figure 12 shows latency reductions ranging from 21.7% to 44%. Not only were the latencies lower in vSAN 6.6, but there is less of a variance between latency when selecting the respective data services.
Figure 12. Reduction in latency for a random write workload with a small working set
While large amounts of sustained random write operations can be a challenge for any storage system, the results showing the improvements when testing a small working set of data is most representative of production workloads that have short duty cycles with a modestly sized working set.
Summary of Analysis
Running the same battery of tests using a smaller working set provides a number of interesting observations.
- All profiles of random I/O generation demonstrated significant improvement in performance.
- In cases of mixed I/O patterns, there were increases in IOPS of up to 86%, and reductions of latency of up to 47.7% when comparing the same test between vSAN 6.5 and vSAN 6.6.
- vSAN 6.6 is consistently able to drive higher performance (more I/O, less latency), using a broader set of data services than vSAN 6.5 using a limited set of data services.
- Significant performance improvement on checksum were realized for a larger variety of reads and writes. This is even more visible in a smaller working set.
- vSAN 6.6 demonstrates a consistent, reduction of impact on performance with the use of data services and protection policies.
- Testing smaller working set sizes showcases the potential benefits that can occur with workloads that have smaller working sets as a result of an overall smaller footprint of data, or shorter duty cycles.
Areas of benefit
Just as with workloads using large working sets, applications that have a mixed I/O pattern combined with a smaller working set will see significant improvements in vSAN 6.6. Applications and scenarios that would see benefit would include, but are not limited to the following.
- Multi-tier, and scale out applications. These types of application architectures tend to have smaller working sets of data dispersed across the application nodes.
- VDI environments. These environments have a number of different workload profiles depending on what task is being performed.
- Transactional applications. These are applications that have a serialized I/O dialog in which the next write is waiting on the previous read. Latency is critical in these types of applications.
- High density of VMs per physical host. This randomizes I/O patterns, and has a greater potential for contention as a result of bursts from more VMs. The reduction in latency will allow for VMs to deliver a lower, more consistent level of latency.
- Latency sensitive applications. These often have strict service level requirements defined in deployment guides.
Latency Improvements
Delivering minimal latency consistently over long periods of sustained activity is a challenge for any storage architecture. Latency of a VM is the best way to measure if the underlying infrastructure is able to deliver adequate performance to an application. Higher latency translates to a longer time a task takes to be completed. Predictable, low latency over a period of time is a desired result for application owners, but can be difficult for storage systems to deliver.
vSAN 6.6 makes significant improvements in this area. Tests that ran for a duration of 4 hours showed significant improvement in the predictability and consistency of latency in vSAN 6.6. When viewing the range of latencies, (low mark to 95th percentile peak), vSAN 6.6 exhibited over an 80% reduction in latency deviation as compared to vSAN 6.5.
Additional Performance Benefits
An increase in IOPS with a reduction in latency can also represent an additional benefit not explicitly measured in this testing. This benefit would show in the form of a reduction of host CPU utilization when comparing steady state, non-synthetic workloads, which is more representative of production environments. Synthetic testing is meant to stress resource usage, and thus, free CPU cycles will be used for processing additional I/O. With steady state workloads, I/O's would be processed more quickly, which reduces the length of time in which host CPU resources are being utilized. Reducing host resources could increase the density of running VMs that could be achieved on a host.
Enhancements behind Performance Improvements
The test comparisons provided in this document showcase the improvements in performance as a result of four specific enhancements.
- Checksum optimizations. Checksum is a technique that provides an additional layer of data integrity for data in flight, or at rest. Checksum is a storage policy setting, applied per object, and is enabled unless explicitly turned off in a policy. Improvements with checksum come from several optimizations. The improvements result in gains in performance on both read and write operations.
- Write buffer destaging optimizations. Destaging is the act of moving recently written data from the write buffer to the capacity tier. Data coming into the buffer can be place more efficiently, which helps reduce the fill rate and usage of the buffer. More proactive destaging of data helps reduce meta-data build up that could impact guest I/O or resync operations. This helps scenarios with large numbers of deletes, which invoke metadata writes. More aggressive destaging can help in write intensive environments, especially when using RAID-5/6, reducing times in which vSAN must experiences "Congestions," vSAN's built in mechanism of controlling contention in the storage system.
- Deduplication and Compression improvements. Deduplication and Compression is a cluster-wide feature in vSAN that improves space efficiency, and is applied as data is destaged from the write buffer to the capacity tier. vSAN 6.6 changes the approach in ordering of the data to be destaged. This offers more predictable performance, especially with sequential writes.
- Object management improvements. vSAN is an object based storage system, and manages its duties by the use of an object manager. vSAN 6.6 has tuned the use of memory by the object manager to help reduce the amount of CPU overhead. This optimization reduces heavy context switching of cache and CPU, making delivery if I/O more efficient.
These performance benefits outlined also extend to back-end vSAN management activities such as rebalancing, resyncing, and repairing of objects. These are activities that will happen in any environment, regardless of the workload type.
Summary
vSAN 6.6 provides a substantial improvement in performance over vSAN 6.5, all with the simple click of a software upgrade. Better performance will allow administrators to provide higher, more consistent levels of performance for the applications and services powered by vSAN, and gives environments the ability to absorb increases in performance and efficiency as more workloads are introduced to a vSAN powered environment.
About the Author
This content in this document was assembled using data collected from extensive testing efforts by the VMware Performance Engineering Team. You can read more from the Performance Engineering team on their blog at VMware VROOM!
Pete Koehler is a Sr. Technical Marketing Manager, working in the Storage and Availability Business Unit at VMware, Inc. He specializes in enterprise architectures, data center analytics, software-defined storage, and hyperconverged Infrastructures. Pete provides more insight to challenges of the data center at vmpete.com, and VMware’s Virtual Blocks. He can also be found on twitter at @vmpete.