DataStax Enterprise on VMware vSAN 6.6 All-Flash for Development
This section covers the business case, solution overview, and key results of DataStax Enterprise on VMware vSAN for development.
Modern cloud applications—that create and leverage real-time value and run at epic scale—require a change in data management with an unprecedented transformation to the decade-old way that databases have been designed and operated. Requirements from cloud applications have pushed beyond the boundaries of the Relational Database Management System (RDBMS) and have introduced new data management requirements to handle always-on, globally distributed, and real-time applications.
VMware vSAN™, the market leader in HyperConverged Infrastructure (HCI), enables low cost and high performance next-generation HCI solutions, converges traditional IT infrastructure silos onto industry-standard servers and virtualizes physical infrastructure to help customers easily evolve their infrastructure without risk, improve TCO over traditional resource silos, and scale to tomorrow with support for new hardware, applications, and cloud strategies. The natively integrated VMware infrastructure combines radically simple VMware vSAN storage management, the market-leading VMware vSphere ® Hypervisor, and the VMware vCenter Server ® unified management solution all on the broadest and deepest set of HCI deployment options.
HCI offers a common control plane across nodes that simplifies storage management for users running vSAN. However, the shared vSAN design is not optimized today to maximize the performance of the distributed DataStax application. To address this, we are teaming up with DataStax to design a future solution that combines the management benefits of HCI with the performance benefits of a shared-nothing storage subsystem.
DataStax Enterprise (DSE) is the always-on data management platform industry leader. DSE is the always-on data platform powered by the industry’s best distribution of Apache Cassandra™—and includes search, analytics, developer tooling, and operations management, all in a unified security model. DSE makes it easy to distribute your data across data centers or cloud regions, making your applications always-on, ready to scale, and able to create instant insights and experiences. Your applications are ready for anything—be it enormous growth, handling mixed workloads, or enduring catastrophic failure. With DSE’s unique, fully distributed, masterless architecture, your application scales reliably and effortlessly.
VMware and DataStax have jointly undertaken an extensive technical validation to demonstrate vSAN as a storage platform for globally distributed cloud applications. In the first phase of this effort, DataStax and VMware jointly advocate this solution for test and development environments while we are working towards a future vSAN offering.
This joint solution is a showcase of using VMware vSAN as a platform for deploying DSE in a vSphere environment. All storage management moves into a single software stack, thus taking advantage of the security, operational simplicity, and cost-effectiveness of vSAN in test and development environments.
Workloads can be easily migrated from bare-metal configurations to a modern, dynamic, and consolidated hyperconverged infrastructure based on vSAN.
vSAN is natively integrated with vSphere, vSphere and vSAN helps to provide smarter solutions to reduce the design and operational burden of a data center.
The reference architecture:
- Provides the solution architecture for deploying DSE in a vSAN cluster for development.
- Measures performance when running DSE in a vSAN cluster, to the extent of the testing and cluster size described.
- Evaluates the impact of different parameter settings in the performance testing.
- Identified steps required to ensure resiliency and availability against various failures.
- Provides best practice guides.
This section provides the purpose, scope, and audience of this document.
This reference architecture illustrates how DataStax Enterprise can be run in a vSAN cluster within development environments and provides baseline performance results based on the testing scenarios described within this paper.
The reference architecture covers the following testing scenarios:
- Baseline testing
- 90% write and 10% read performance testing
- 50% write and 50% read performance testing
- Resiliency and availability testing
This reference architecture is intended for DataStax Enterprise administrators and storage architects involved in planning, designing, or administering of DSE on vSAN for development purposes.
This section provides an overview of the technologies used in this solution: • VMware vSphere 6.5 • VMware vSAN 6.6 • DataStax Enterprise
VMware vSphere 6.5
VMware vSphere 6.5 is the next-generation infrastructure for next-generation applications. It provides a powerful, flexible, and secure foundation for business agility that accelerates the digital transformation to cloud computing and promotes success in the digital economy.
vSphere 6.5 supports both existing and next-generation applications through its:
- Simplified customer experience for automation and management at scale
- Comprehensive built-in security for protecting data, infrastructure, and access
- Universal application platform for running any application anywhere
With vSphere 6.5, customers can run, manage, connect, and secure their applications in a common operating environment, across clouds and devices.
VMware vSAN 6.6
VMware vSAN is the industry-leading software powering VMware’s software defined storage and HCI solution. vSAN helps customers evolve their data center without risk, control IT costs and scale to tomorrow’s business needs. vSAN, native to the market-leading hypervisor, delivers flash-optimized, secure storage for all of your critical vSphere workloads. vSAN is built on industry-standard x86 servers and components that help lower TCO in comparison to traditional storage. It delivers the agility to easily scale IT and offers the industry’s first native HCI encryption.
Figure 1. HCI Evolution
DataStax Enterprise, powered by the best distribution of Apache Cassandra, seamlessly integrates your code, allowing applications to utilize a breadth of techniques to produce a mobile app or online applications. DSE makes it easy to distribute your data across data centers or cloud regions, making your applications always-on, ready to scale, and able to create instant insight and experiences. DataStax Enterprise provides flexibility to deploy on any on-premise, cloud infrastructure, or hybrid cloud, plus the ability to use multiple operational workloads, such as analytics and search, without any operational performance degradation.
Check out more information about DataStax Enterprise at https://www.datastax.com/ .
The following terms and parameters are introduced in this reference architecture:
- DataStax OpsCenter is a visual management and monitoring solution for DataStax Enterprise.
A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are stored on disk sequentially and maintained for each Cassandra table.
- Replication factor (RF)
The total number of replicas across the DSE cluster is referred to as the replication factor (RF). A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 3 means three copies of each row, where each copy is on a different DSE node. All replicas are equally important, so there is no primary.
- Consistency level (CL)
A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.
- For consistency level LOCAL_ONE, a write must be sent to, and successfully acknowledged by, at least one replica node in the local data center. By default, a read repair runs in the background to make the other replicas consistent. It provides the highest availability of all the levels if application can tolerate a comparatively high probability of stale data being read. The replicas contacted for reads may not always have the most recent write.
- For consistency level QUORUM, a write must be written to the commit log and memtable on a quorum of replica nodes across all data centers. A read returns the record after a quorum of replicas from all data centers has responded.
The process of consolidating SSTables , discarding tombstones, and regenerating the SSTable index. Compaction does not modify the existing SSTables (SSTables are immutable) and only creates a new SSTable from the existing ones. When a new SSTable is created, the older ones are marked for deletion. Thus, the used space is temporarily higher during compaction. The amount of space overhead due to compaction depends on the compaction strategy used. This space overhead needs to be accounted for during the planning process.
This section introduces the resources and configurations for the solution including an architecture diagram, hardware and software resources, and other relevant configurations.
We created an 8-node vSphere and vSAN all-flash cluster and deployed a 16-node DSE cluster, and we deployed Dastastax OpsCenter and 8 DSE stress client VMs running Cassandra-stress on another hybrid vSAN cluster in the same VM network. Typically, one stress client can generate workload to saturate two DSE nodes.
Figure 2. Solution Setup
To ensure continued data protection and availability during planned or unplanned down time, DataStax recommends a minimum of four nodes for the vSAN cluster. For testing, an 8-node vSAN cluster was used with 16 DSE nodes to validate that the cluster will function as expected for typical workloads and scale of similarly sized test and development environments. Figure 3 shows we deployed two DSE VMs on each ESXi host.
In our solution validation, we used NVMe as a cache  tier and configured two disk groups per node. Each disk group had one cache NVMe and four capacity SSDs. vSAN storage policy PFTT (Primary Number of Failures to Tolerate) was set to 1 and software checksum was disabled. You can customize the storage policy for different DSE applications to satisfy performance, resource commitment, checksum protection, and quality of service requirements.
We state “Development Use” throughout this document. DataStax is designed to run best on a shared-nothing architecture. DataStax and VMware are working together on vSAN enhancements with a design focused on cloud applications that require data to be contextual, always on, real time, distributed, and scalable. This near-term future offering will be the recommended production approach. We jointly advocate that users get started now with the suggested architecture below and adopt the future architecture when released.
Table 1 shows the hardware resources used in this solution.
Each ESXi Server in the vSAN cluster has the following configuration.
Table 1. Hardware Resources per ESXi Server
|CPU and cores||2 sockets, 12 cores each of 2.3GHz with hyper-threading enabled|
|Network adapter||2 x 10Gb NIC|
|Storage adapter||SAS Controller Dell LSI PERC H730 Mini|
|Disks||Cache-layer SSD: 2 x 1.8TB Intel SSD DC P3700 NVMe (controller included)
Capacity-layer SSD: 8 x 800GB 2.5-inch Enterprise Performance SATA SSD S3710.
Table 2 shows the software resources used in this solution.
Table 2. Software Resources
|VMware vCenter Server and ESXi||6.5.0d
(vSAN 6.6 is included)
|ESXi Cluster to host virtual machines and provide vSAN Cluster. VMware vCenter Server provides a centralized platform for managing VMware vSphere environments.|
|VMware vSAN||6.6||Solution for hyperconverged infrastructure.|
|Ubuntu||14.04||Ubuntu 14.04 is used as the guest operating system of all the virtual machines.|
|DSE||5.1||DataStax Enterprise 5.1.|
|Cassandra-stress||3.10||The Cassandra-stress tool is a Java-based stress testing utility for basic benchmarking and load testing a Cassandra cluster.|
VM and Database Configuration
We used the virtual machine settings as the base configuration as shown in Table 3. We configured DSE data directories on the VM data disk and the commit log directory on the log disk.
The rule is that the aggregated CPU cores and memory should not exceed the physical resources to avoid contention. When calculating physical resources, we should count the physical cores before hyper-threading is taken into consideration.
Table 3. DSE VM and Database Configuration
|Disks||OS disk: 40GB
Data disk: 1,150GB
Log disk: 50GB
The configuration of stress client VMs is relatively small compared to that of the DSE VM due to much less workload on client VMs.
Table 4. Configuration of Stress Client VMs
|Disks||OS disk: 40GB|
- We used paravirtual SCSI controller and made each virtual disk use a separate controller for better performance.
- We configured a separate storage cluster on the hybrid management cluster to avoid any performance impact on the tested DSE cluster.
- We followed the recommended production setting for optimizing DSE installation on Linux.
Test Tool and Workload
Cassandra-stress: The Cassandra-stress tool is a Java-based stress testing utility for basic benchmarking and load testing a Cassandra cluster. The eight client nodes in our solution all ran Cassandra-stress. The four types of workload used were:
- 100% write
- 100% read
- 90% write and 10% read
- 50% write and 50% read
When started without a YAML file, Cassandra-stress creates a keyspace, keyspace1, tables, and standard1. These elements are automatically created the first time you run a stress test and are reused on subsequent runs.
It also supports a YAML-based profile for testing reads, writes, and mixed workloads, plus the ability to specify schemas with various compaction strategies, cache settings, and types.
Writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log on disk before they are acknowledged as a success. If a crash or server failure occurs before the memtables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
For testing with a preloaded data set as a starting point: when we prepare the data set, we set commitlog_sync to "periodic" where writes may be acknowledged immediately and the commit log is simply synced every 1,000 milliseconds (ms) for this stage only to ingest as fast as possible.
We configured commitlog_sync in batch mode in the cassandra.yaml file for testing. DSE will not acknowledge writes until the commit log has been fsynced to the disk, this is the requirement for most real applications.
In this section, we present the test methodologies and results used to validate this solution.
We conducted an extensive validation to demonstrate vSAN as a platform for globally distributed cloud applications for test and development environments.
Our goal was to select workloads that are typical of today’s modern applications, we tested standard mixed workload without YAML file in the baseline performance testing and tested a user-defined typical IoT workload with YAML file as an example for a user to conduct testing based on their own data modeling.
We used large data sets to simulate real case so the data set of each node should obviously exceed the RAM capacity, and we loaded base data set of at least 500GB per node. In addition, we used OpsCenter to monitor the vitals of the DSE cluster continuously.
To generate the initial data set, we ran 100% write on all the eight clients concurrently until all the running was completed and all the compaction processes were completed. After the data was loaded, we took snapshots.
We loaded the snapshot as the preloaded data set before each workload testing. This solution included the following tests:
- Baseline testing: to tune the VM and DSE configuration setting to better utilize the hardware capacity.
- Performance testing: to validate the cluster functions as expected for typical workloads for performance consistency and predictable latency in the test and development environments.
- Resiliency and availability testing: to verify vSAN’s storage-layer resiliency features combined with DSE’s peer-to-peer design enable this solution to meet the data availability requirements of even the most demanding applications under predictable failure scenarios and the impact on application performance is limited.
Result Collection and Aggregation
- Cassandra-stress provides its results in both structured text format and html file. The throughput in operations per second (ops/sec), operation specifics in latencies and other metrics are recorded for each testing on each client instance.
- The test results are aggregated as each client instance produces its own output. The total throughput ops/sec is the sum-up of ops/sec in all clients and latencies are the average values of all client instances.
- For determining request percentiles, we selected the 50 th , 95 th , and 99 th percentile latency. The 99th percentile is the most commonly used one. We lined these three percentile points to show the latency curves. The flatter the curve is, the better the result will be. All latency charts are displayed in the same way throughout the paper.
- If there are any abnormal results, we use vSAN Performance Service to monitor each level of performance in a top-down approach of the vSAN stack.
In the baseline testing, we used Cassandra-stress to test with default cassandra.yaml settings and then changed the parameters to tune the performance.
The baseline testing includes the following tests:
- Evaluate the impact of VM settings
- Tuning Java resources
- Tuning Cassandra.yaml parameters
Evaluate the Impact of VM Settings
In this test, write operations appeared to be CPU-bound. DSE is highly concurrent and uses as many CPU cores as available. The maximum of DSE VM CPU depends on the number of physical CPU cores.
We verified the more CPUs of the DSE nodes, the more throughputs for loading data. We set the DSE VM memory to 64GB and tested loading 1,000,000 records RF=3 CL= LOCAL_ONE, TH  =256 with 8vCPU, 12vCPU, and 24vCPU.
Figure 4. Load Testing with Different vCPU Performance
From the results shown in Figure 4, the IOPS increased when DSE VM had more vCPUs, the write IOPS of the 8vCPU configuration was 36,487; the write IOPS of the 12vCPU configuration increased to 50,520, which was about a 40% increase; the write IOPS increased to 60,497 of the 24vCPU configuration. Actually 24vCPU can hold more concurrency. The IOPS increased with the increase of threads. We used the same number of threads for simplicity.
From the latency curve in Figure 5, we observed that the curve of the 24vCPU configuration was the flattest one while the curve of 8vCPU configuration was the steepest.
Figure 5. Load Testing with Different vCPU Latency
Tuning Java Resources
DSE runs on a Java virtual machine (JVM). Insufficient garbage collection (GC) throughput can introduce GC pauses that can lead to a high latency. To achieve the best performance, it is important to select the right garbage collector and heap size settings.
In our solution, we used DSE version 5.1 that uses the G1 garbage collector by default, G1 performs better than CMS for larger heaps. G1 is easier to configure and is self-tuning and only needs to set MAX_HEAP_SIZE to determine the heap size of the DSE JVM, DSE also largely depends on the OS file cache for read performance. Hence, choosing an optimum JVM heap size and leaving enough memory for OS file cache is important. Heap size is usually between one quarter and half of the system memory. We followed the Set the heap size for optional Java garbage collection in DataStax Enterprise guide to determine the optimum heap size.
The HEAP_NEW_SIZE parameter is the size of the young generation in Java. The rule is to set this value at 100 MB per vCPU if using CMS (It is not recommended to change these settings for G1).
From the Cassandra-stress workload testing, we can easily observe the impact of various heap size. When the memory size of the DSE VM was 128GB, we compared the heap size of 48GB and 64GB (half of the system memory).
We ran a write workload on two stress client nodes to load data on a 4-node DSE cluster till we reached 500GB of uncompressed SSTable data. We loaded 3,000 million records with RF=3, and the data density was 533GB/node.
Figure 6. 100% Read with Different Heap Size Performance
We performed Cassandra-stress read testing with TH=96 and CL=QUORUM on the preloaded data set, read IOPS was 32,637 when the heap size was 64GB; read IOPS was 58,219 when the heap size was 48GB, which was nearly an 80% increase. And the mean read latency decreased from 2.9ms to 1.6ms.
Figure 7. 100% Read with Different Heap Size Latency
From the latency curve as shown in Figure 7, the curve of the heap size=48GB setting was flatter, which meant more consistent latencies for various requests.
Tuning Cassandra.yaml Parameters
We used the default value for all the parameters except the following ones:
- concurrent_reads (Default: 32): For workloads with data more than memory size, the bottleneck is reads fetching data from disk. Setting to (16 × number_of_drives ) allows operations to queue low enough in the stack so that the OS and drives can record them. You can increase the value for systems with a fast I/O storage. In our testing, we set our concurrent_reads to 64 to get better performance.
- concurrent_writes (Default: 32): Writes in DSE are rarely I/O bound, so the ideal number of concurrent writes depends on the number of CPU cores in your system. The general recommended value is 8 × number_of_cpu_cores .
- concurrent_compactors , defaults to the smaller of (number of disks, number of cores), with a minimum of 2 and a maximum of 8. If data directories are backed by SSD, the recommendation is to increase to the number of cores. In our testing, we set the it to 8.
- compaction_throughput_mb_per_sec (Default:16 ): In general, setting this to 16 to 32 times the rate of inserting data is more than sufficient to keep the SSTable count down. In our testing, we set it to 32. Concurrent_compactors and compaction_througput_mb can be dynamically changed by command in DSE running time.
Note: The key_cache_size_in_mb (key cache) will not prevent multiple SSTables from being accessed for cold data. Given that the stress workload is effectively random access, the key cache will not likely have more impact than buffer-cache, so we leave it as default value. We do not recommend that users tune the key cache, but when they do, it is highly workload specific.
In our solution validation, we also keep the row_cache_size_in_mb (row cache) as default value 0 since our testing is random access and the data set is much larger than memory.
 TH stands for the thread number, we use TH for short in this reference architecture.
Performance Testing Procedure
Write the Base Data Set
This set of stress tests is run on each of the 8 stress client VMs. This will parallelize the data generation across the cluster. The YAML file on each of the 8 stress client VMs should list an increasing range of sequence numbers for the partition key (below is an example).
Node1 columnspec: ‐ name: machine_id population: seq(1..100000000) Node2 columnspec: ‐ name: machine_id population: seq(100000000..200000000) Node3 columnspec: ‐ name: machine_id population: seq(200000000..300000000) … Node8 columnspec: ‐ name: machine_id population: seq(700000000..800000000)
After the 8 YAML files are configured and in place on each stress client, we ran Cassandra-stress write test on each stress client node pointing it to each of the 16 DSE VM’s. Then we ran nodetool status to see the amount of data per node. We repeated this process, continuing to increment the seq distributions on each stress client, until 500GB per node is reached (note this may take several iterations; for example, 3 iterations).
After each node holds over 500GB data, replace seq with uniform under the column_spec section of YAML file on all of the nodes. The purpose of this is that cassandra_stress randomly inserts and reads in subsequent tests rather than in sequence.
Node1 columnspec: ‐ name: machine_id population: uniform(1..300000000) Node2 columnspec: ‐ name: machine_id population: uniform(30000000..600000000) Node3 columnspec: ‐ name: machine_id population: uniform(600000000..900000000) ... Node8 columnspec: ‐ name: machine_id population: uniform(210000000..2400000000)
Get the Peak Performance
We will start by having the 8 stress client nodes with each running a 90% write and 10% read workload to get a measure of how we stress the cluster and to work towards maximizing the peak performance of the cluster, and then backing off from that to a workload we would advise customers to run in a typical scenario.
Each stress client node should run the stress test starting with 300 threads, then tune the number of threads up or down in increments of 50 until you see a 99% latency hover between 50ms and 100ms. If you get 450 threads without pushing the latency to 50-100ms, you will need to add stress client VMs to generate additional load.
Test at Expected Load
We ran the stress tests from the 8 stress client VMs and reduced the thread count on each one by 30%.
We conducted the performance testing with various parameters to evaluate the impacts:
- 90% write and 10% read workload: we chose this for a typical Internet of Things (IoT) workload.
- 50% write and 50% read workload: the mixed 1:1 write/read is a generic workload to evaluate the storage performance.
90% Write and 10% Read Performance Testing Results
In the performance testing, Cassandra-stress randomly inserts and reads, running a 90% write and 10% read workload on the preloaded data set (the average data density is 540GB/node).
In the testing, we adjusted the number of threads. When it was 200, the 99th percentile latency was 65.9ms, which was between 50ms and 100ms. Table 5 shows the performance data and it is considered to be the peak performance.
Table 5. 90% Write and 10% Read Peak Performance
99 th Percentile
Test at Expected Load
Then we ran the stress testing from the 8 stress client VMs, and reduced the thread count on each one by 30%, which was 140 in our testing.
Evaluate the Impact of Testing Duration
We ran the “expected load” testing for 1 hour twice and then ran it for 24 hours for comparison with RF=3 and TH=140 on each stress test client:
- The write IOPS was 124,528 in the 1-hour round1 testing, 120,229 in the round2 testing, and 124,725 in the 24-hour testing.
- The read IOPS was 13,836 in the 1-hour round1 testing, 13,359 in the round2 testing, and 13,858 in the 24-hour testing, the difference was less than 4%, which proved the consistency of throughput.
- The mean read latency was 2.16ms in the 1-hour round1 testing, 2.39ms in the 1-hour round2 testing, and 2.66ms in the 24-hour testing. The mean write latency was 8.73ms in the 1-hour round1 testing, 9.01ms in the round2 testing, and 8.65ms in the 24-hour testing. The latencies were within a reasonable variation, which validated a consistent performance.
Figure 8. 90% Write and 10% Read TH=140 Performance with Different Duration
Figure 9 illustrates the latency curve in 90% write and 10% read performance testing. Median read latency was less than 2ms while median write latency was less than 8ms. The latency number of the 95th percentile and the 99th percentile was reasonable. And from the curve, the results of 1-hour run and 24-hour run were consistent.
Figure 9. 90% Write and 10% Read TH=140 Latency
From the performance results, the performance in long duration testing was consistent with the performance of 1-hour run.
Evaluate the Impact of Base Data Set
In our solution validation, we simulated the real user scenario: data sets grow and the base data set is much larger than memory and read access is random so that it will hit the disk. The VM memory of each DSE node was 64GB. We ran the “expected load” test on 180GB data density and compared the results with 540GB density to evaluate the impacts of data set size. Cassandra-stress randomly inserts and reads, running a 90% write and 10% read workload on the preloaded data set (the average data density was 180GB/node).
Similar to the testing on 540GB data density, we adjusted the number of threads. When it was 250, the 99th percentile latency was 60.2ms, which was between 50ms and 100ms. Table 6 shows the performance data, and it was considered to be the peak performance. The total IOPS was 178,647, about 25% increase in throughput comparing to the total IOPS of 142,727 with 540GB density.
Table 6. 90% Write and 10% Read Peak Performance with 180GB Density Base Data Set
95 th Percentile
99 th Percentile
Then we reduced the threads by 30% to 175 and ran the expected load test with a density of 180GB. The write IOPS was 144,707, about a 16% increase comparing to the result of a density of 540GB and TH=140, read IOPS was 16,078 compared to 13,858 with a density 540GB and TH=140. We reduced the threads to 140, write IOPS was 129,716, which was a slight increase compared with 124,528 with density 540GB, and read IOPS increased slightly from 13,836 to 14,417. The mean write latency and mean read latency were within the reasonable variation.
Figure 10. 90% Write and 10% Read with Different Data Set Performance
From the latency curve as shown in Figure 11, the read latency median was less than 1.5ms, the write latency median was less than 8ms, and the latency values of the 95th percentile and the 99th percentile were acceptable.
Figure 11. 90% Write and 10% Read with Different Data Set Latency
From the above analysis, when the data set is much larger than the RAM size, smaller data set has better concurrency. It served more requests and IOPS was higher in some extent, but much less than the ratio of data set size. From the latency perspective, as long as we set the thread number as an expected load, the latency curve is similar, although the latency value will increase slightly.
Evaluate the Impact of the Consistency Level Setting
Consistency levels in DSE can be configured to manage availability versus data accuracy. Configure consistency for a session or per individual read or write operation.
The consistency level default value was LOCAL_ ONE for all write and read operations in our Cassandra-stress test since our testing keyspace is set to NetworkTopologyStrategy. And in our solution validation, we use one data center, which means LOCAL_ONE equals to ONE. It satisfies the needs of most users when consistency requirements are not stringent.
We changed CL to QUORUM and compared the IOPS and latency with the setting of CL=LOCAL_ONE workload testing. In our testing, we use RF=3, which means quorum is 2. We tested with TH=140 and it is reasonable that higher consistency level impacts the performance. From Figure 12, test with TH=140, CL=QUORUM, write IOPS dropped from 124,528 to 86,780 and read IOPS dropped from 13,836 to 9,649, which was about 30% decrease in throughput. And the mean read latency increased to 3.4ms while the mean write latency increased to 12.5ms.
Figure 12. 90% Write and 10% Read with Different Consistency Level Performance
Figure 13 shows the latency curve was steep, the 99th percentile was over 45ms, which indicated we were pushing the stress tests too much, so we adjusted the thread count to 100 in the quorum test and reran it. The results show similar write IOPS and read IOPS with TH= 140, while the mean write latency dropped to 8.96ms and read latency dropped to 2.39ms, so for CL=QUORUM, TH =100 is a better test setting.
From Figure 13, CL=QUORUM TH=100 and CL=LOCAL_ONE, TH=140 got a similar curve, while TH=140 latency curve was steeper compared with the other two, so we lowered the thread count to keep the latency consistent.
Figure 13. 90% Write and 10% Read with Different Consistency Level Latency
From the testing results, we got less concurrency with the setting of CL=QUORUM, the throughput was lower and the latency was higher than the value of the CL=LOCAL_ONE setting.
50% Write and 50% Read Performance Testing Results
Run the stress testing with various thread number. Figure 14 shows with the increase of thread count, the throughput increased accordingly. Write IOPS was 71,350 with TH=100, 85,104 with TH=140, about 20% increase, and 90,212 with TH=175, about 26% increase comparing with TH=100. Read IOPS results were close.
The mean latencies also increased as the thread count increased. The mean read latency for TH=100 was 2.43ms, it increased to 2.81ms with TH=140 and then increased to 3.49ms with TH=175. The mean write latency was 8.76ms with TH=100, and increased to 10.31ms with TH=140 and to 11.98ms with TH=175.
Figure 14. 50% Write and 50% Read with Different Thread Performance
From the latency curve in Figure 15, TH=175 read latency curve and write latency curve were much steeper. TH=140 latency curve was similar to that of TH =100. If users consider latency together with throughput, TH=140 is a better setting.
Figure 15. 50% Write and 50% Read with Different Thread Latency
Resiliency and Availability
vSAN’s storage-layer resiliency features combined with DSE’s peer-to-peer design enable this solution to meet the data availability requirements of even the most demanding applications. A set of failure scenarios are created to validate data availability. In the failure testing, we again ran Cassandra-stress against the preloaded data set with RF=3, we validated failures while running the performance testing workload. From the perspective of failure, we conducted three types of failure:
- A physical disk failure in a vSAN datastore, which will cause vSAN objects residing on this disk to enter a degraded state. With the storage policy set with PFTT=1, the object can still survive and serve I/O. Storage-layer resiliency handles this failure, thus from the DSE VMs’ perspective, there is no interruption of service.
- A DSE VM failure in the DSE cluster, which will cause loss of a DSE node, but application resiliency ensures the service is not interrupted and only performance is impacted since the cluster is smaller.
- A physical host failure will power off all the running VMs residing on it. In our validation, the DSE cluster loses two nodes but the service is not interrupted. And with VMware vSphere ® High Availability enabled, when a host fails, vSphere HA will restart the affected VMs on other hosts which will significantly speed up the failover process.
- The testing was conducted using the EXT4 filesystem , and as the manual describes, the default journal mode for ext4 is data=ordered, which means all data is forced directly out to the main file system prior to its metadata being committed to the journal.
- Users must also mount the data and log filesystem with data=journal, all data and metadata are written to the journal before being written to disk, you can always replay interrupted I/O jobs in case of a crash. Other filesystems were not tested but the journaling requirement would not change.
- In the solution configuration, there are two DSE nodes residing on a physical host, we must use rack-aware snitch to ensure that multiple data replicas are not stored on DSE nodes of the same ESXi host. With the rack name in the cassandra-rackdc.properties file aligned to a physical chassis name, replicas will split across chassis for a given token.
Disk Failure Testing
We simulated a physical disk failure by injecting a disk failure to a capacity SSD drive. The following procedures were used:
- We started expected performance workload testing, then chose a capacity disk with its NAA ID recorded and verified there were affected vSAN components on it. We can see from Figure 16 that the data disk of ‘ndse-11’ would be impacted if that physical disk fails. Actually the physical disk failure impacted multiple DSE node disks as shown in Figure 17.
- We used the disk fault injection script described in the Failure Testing document to inject hot unplug to the physical disk we chose; the same tests can be run by simply removing the disk from the host if physical access to the host is convenient. The components that resided on that disk in VM ndse-11 quickly showed up as absent.
Figure 16. Physical Disk Failure Caused Component Absent
Figure 17. Multiple Components Impacted due to Physical Disk Failure
3. After 20 minutes (before the 1-hour timeout, which means that no rebuilding activity will occur during this test), to put the disk drive back in the host, simply rescan the host for new disks to make sure NAA ID is back and then clear any hot unplug flags set previously with the -c option using the vsanDiskFaultInjection script.
We verified the results:
- From vSAN’s perspective, just one of the two vSAN replicas failed so the I/O should not be interrupted. The impacted components should be marked as absent. After the disk is back online, a ‘resyncing’ of the component should start immediately. See Figure 18 and in our validation, the component status was back to active in two minutes as Figure 19 shows.
Note: This is only required in development environments with the current vSAN offering. We are working on a future vSAN enhancement that will utilize FTT=0 for distributed applications.
- From DSE’s perspective, no DSE node should fail and Cassandra-stress clients should all report normal results and performance impact should be negligible. From Figure 20 and Figure 21, the IOPS and latency of the performance testing were consistent.
Figure 18. Component Became ‘Absent-resyncing’ when the Physical Disk was Back in One Hour
Figure 19. Component Became ‘Active’
Figure 20 shows the performance result from one test client. The throughput has been stable since the workload test started.
Figure 20. 90% Write and 10% Read Testing IOPS during Disk Failure
The median latency was also consistent during the testing and the fluctuation level was acceptable.
Figure 21. 90% Write and 10% Read Testing Latency Median during Disk Failure
VM Failure Testing
We simulated a DSE node failure by shutting down the VM. The following procedures were used:
- We started the expected performance workload testing.
- We shut down one DSE node “ndse-11” from the vSphere web client console as shown in Figure 22, then we powered on the VM after one hour.
Figure 22. DSE VM ndse-11 Failure
From the output of “nodetool status” in Figure 23, we verified ndse-11 (10.156.129.44) status was DN (down).
Figure 23. One DSE Node was Down
3. We verified the expected results: DSE application service was not interrupted and the performance was degraded as expected due to the loss of a DSE node.
From Figure 24, we can see IOPS was steady during the VM failure and Figure 25 shows the latency fluctuation was acceptable.
Figure 24. 90% Write and 10% Read Testing IOPS during VM Failure
Figure 25. 90% Write and 10% Read Testing Latency Median during VM Failure
Host Failure Testing without vSphere HA
In this scenario, we validated the DSE application-level availability worked as expected. We simulated a host failure by powering off a physical ESXi host. The procedures were:
- We chose a host ’10.159.16.34’ and checked the virtual machines running on it: ‘ndse-2’ and ‘ndse-10‘ were two DSE nodes.
- We powered off ’10.159.16.34’ from the host’s vSphere web client console.
- We verified that the two DSE nodes “ndse-2’ and ‘ndse-10’ were powered off and since host 10.159.16.34 was down, and the status of the affected virtual machine was set to disconnected as shown in Figure 26.
Figure 26. Host Failure Caused Two DSE VM Nodes Down
- From the DSE cluster, the service was not interrupted, so there was no error on the stress test clients and the test performance was degraded as expected because two nodes were down. Figure 27 shows the output from one test client results, when the failure occurred, there was a sharp decrease in throughput for less than 1 minute due to the failover of request to the affected two nodes.
Figure 27. 90% Write and 10% Read Testing IOPS during Host Failure
Note: The IOPS value displayed in the log was 568 (not zero) in the downtime.
- From the output screenshot in Figure 28, we took a deeper look at the details and verified the performance was degraded, IOPS dropped from 14,757 to 614 when the failure occurred and it took less than 1 minute to recover back to the normal throughput, which was 15,604 IOPS.
- When the failure occurred, latency median dropped in the failover process because the IOPS was much less, and it returned to normal as synchronized with IOPS.
Figure 28. 90% Write and 10% Read Testing Latency Median during Host Failure
4. We powered on host 10.159.16.34 and then powered on VM ‘ndse-2’ and ‘ndse-10’, the DSE cluster showed the two nodes were up again in a few minutes.
Host Failure Testing with vSphere HA
This section offers supplemental guidance for the test and development solution which is relevant to the current version of vSAN and not applicable to the future vSAN offering when used with DataStax Enterprise.
In this scenario, we validated that the DSE application-level availability together with vSphere HA speed up the failover process. We simulated a host failure by powering off a physical ESXi host. The procedures were:
1. We chose a host ’10.159.16.34’ and checked the virtual machines running on it: ‘ndse-2’ and ‘ndse-10’ were two DSE nodes.
Figure 29. Two DSE VMs Down when the Host Failure Occurred
2. We powered off ’10.159.16.34’ from the host’s vSphere web client console.
3. We verified that the two DSE nodes “ndse-2’ and ‘ndse-10’ were powered off and then they were restarted by vSphere HA automatically in one minute. Figure 30 illustrates ’ndse-10’ was restarted on host ’10.159.16.37’ and Figure 31 shows ‘ndes-2’ was restarted on host ’10.156.16.38, and from the DSE cluster, the DSE nodes became up again in two minutes. In the host failure without vSphere HA, the two nodes could only be back after the host was recovered, which took much longer time.
Figure 30. DSE VM ndes-10 was Restarted by vSphere HA on Host 10.159.16.37
Figure 31. DSE VM ndse-2 was Restarted by vSphere HA on Host 10.159.16.38
Since host ‘10.159.16.34’ was still down, the cluster was back to normal from DSE cluster’s perspective. However, from VM’s perspective, multiple components that reside on host ’10.159.16.34’ were absent as shown in Figure 32.
Figure 32. Components were ‘Absent’ due to a Host Failure
4. We powered on host “10.159.16.34’ after 20 minutes and the affected components resynced and became active again in a few minutes as Figure 33 demonstrates.
Figure 33. Components Became Healthy Again
Failure Testing Summary
The failure testing results were summarized in Table 7.
Table 7. Failure Testing Results
|FAILURE TYPE||TEST DESCRIPTION||RESULT|
|Disk failure||Fail one disk, and clear the failure after 20 minutes.||Performance impact is negligible, after clearing the failure, vSAN resyncs the data in less than 2 minutes.|
|VM failure||Power off a VM and power on after an hour.||Service is not interrupted, performance is degraded as expected due to the loss of a DSE node.|
|Host failure (without vSphere HA)||Power off a host and power on after 20 minutes.||Performance is degraded due to the loss of two DSE nodes as expected, but service is not interrupted. vSAN only resyncs the data without component rebuild since the host is back within one hour.|
|Host failure (with vSphere HA)||Power off a host and power on after 20 minutes.||Performance is degraded due to the loss of two DSE nodes as expected, but service is not interrupted, and HA will restart the affected two DSE nodes in less than 1 minute and the workload returns to normal in 2 minutes.|
This section provides the recommended best practices for this solution.
When configuring DSE in a vSAN cluster, consider the following best practices:
- Before the deployment, plan the database size and required performance level. This will ensure the hardware is properly deployed and software settings are best suited to serve the requests. Plan the hardware resource per VMware vSAN Design and Sizing Guide 6.6.
Specially for DSE database capacity planning, the database routinely requires disk capacity for compaction and repair operations during normal operations. For optimal performance and cluster health, DataStax recommends not filling disks to capacity, but running at 50% to 80% capacity depending on the compaction strategy and size of the compactions, so we need to consider the extra capacity needed.
- When configuring the virtual CPU and memory for DSE VMs, choose an appropriate CPU and memory number to best suit the users’ requirements. VM aggregated CPU cores and memory should not exceed the physical resources to avoid contention.
Choosing an optimum JVM heap size and leaving enough memory for OS file cache is important, follow the Set the heap size for optional Java garbage collection in DataStax Enterprise guide to determine the optimum heap size.
- Use different virtual disks for the DSE data directory and the log directory. If components of DSE VMs with one data disk are not fully distributed across vSAN datastore, customers can use multiple virtual disks for the data directory to make full use of physical disks.
- Set the type of virtual SCSI controller to paravirtual. The maximum number of SCSI controller is four, make each virtual disk use a separate controller if the VMDK number is equal to or less than four.
- Configure separate storage cluster on the hybrid management cluster to avoid performance impact on the tested DSE cluster.
- For test and development environments, after the base data set is loaded and the compaction is finished, take snapshots for reuse.
- Higher consistency level affects performance, choose an appropriate consistency level to meet application requirements.
- Tune client threads based on the specific workload. In general, generate the peak performance, and reduce thread count by 30%.
- Follow Datastax Recommended production settings
This section provides a summary of this reference architecture and validates DataStax Enterprise on vSAN 6.6 All-Flash for Development.
Overall, deploying, running, and managing DSE applications on VMware vSAN provides predictable performance and high availability. All storage management moves into a single software stack, thus taking advantage of the security, performance, scalability, operational simplicity, and cost-effectiveness of vSAN.
It is simple to expand using a scale-up or scale-out approach without incurring any downtime. With the joint efforts of VMware and DataStax, customers can deploy DSE clusters on vSAN for their modern cloud applications with ease and confidence in test and development environments. Check back for further developments around the future of this partnership including vSAN enhancements with a design focused on cloud applications that require data to be contextual, always on, real-time, distributed, and scalable.
Check out more vSAN details and customer stories.
See more vSAN details and customer stories:
For more information regarding DataStax Enterprise, see DataStax .
About the Author
This section provides a brief background on the authors of this solution guide.
Sophie Yin, Solutions Architect in the Product Enablement team of the Storage and Availability Business Unit wrote the original version of this paper.
Kathryn Erickson, Director of Strategic Partnerships at DataStax, worked with Sophie Yin as a secondary author on this paper.
Catherine Xu, Senior Technical Writer in the Product Enablement team of the Storage and Availability Business Unit edited this paper to ensure that the contents conform to the VMware writing style.