CockroachDB on VMware Cloud Foundation

Executive Summary

CockroachDB delivers enterprise class scalability, resiliency, and geo-partitioning without complexity. Through its truly elastic architecture, users can effortlessly scale their databases with minimal administrative overhead. Fundamentally all databases provide some level of resiliency and availability, CockroachDB pushes these notions one step further by providing a database that can survive machine, datacenter, and region failures. Finally, in today’s globalized landscape, CockroachDB provides the geo-partitioning required to meet stringent data sovereignty regulations and provide low latency transaction through its ability to enforce data locality.

These characteristics, coupled with VMware Cloud Foundation™, the ubiquitous hybrid cloud platform that delivers a full set of highly secure software-defined services for compute, storage, and networking with automated deployment and lifecycle management, provide the ideal solution to host business critical applications.

Introduction

VMware Cloud Foundation is an integrated software platform that automates the deployment and lifecycle management of a complete software-defined data center (SDDC) on a standardized hyperconverged architecture. It can be deployed on premises on a broad range of supported hardware or consumed as a service in the public cloud from the selected VMware Cloud ProvidersTM and VMware’s own VMware CloudTM on AWS. With the integrated cloud management capabilities, the end result is a hybrid cloud platform that can span private and public environments, offering a consistent operational model based on well-known VMware vSphere® tools and processes, and the freedom to run apps anywhere without the complexity of app re-writing.

This document outlines best practices for deploying CockroachDB in an on-premises vSphere cluster running on VMware Cloud Foundation 4.

Overview

The solution technology components are listed below:

  VMware Cloud Foundation

  • VMware vSphere
  • VMware vSAN
  • VMware NSX Data Center

  CockroachDB

VMware vSphere

VMware vSphere is VMware's virtualization platform, which transforms data centers into aggregated computing infrastructures that include CPU, storage, and networking resources. vSphere manages these infrastructures as a unified operating environment and provides operators with the tools to administer the data centers that participate in that environment. The two core components of vSphere are ESXi™ and vCenter Server®. ESXi is the hypervisor platform used to create and run virtualized workloads. vCenter Server is the management plane for the hosts and workloads running on the ESXi hosts.

VMware vSAN

VMware vSAN is the industry-leading software powering VMware’s software defined storage and HCI solution. vSAN helps customers evolve their data center without risk, control IT costs and scale to tomorrow’s business needs. vSAN, native to the market-leading hypervisor, delivers flash-optimized, secure storage for all of your critical vSphere workloads, and is built on industry-standard x86 servers and components that help lower TCO in comparison to traditional storage. It delivers the agility to easily scale IT and offers the industry’s first native HCI encryption.

vSAN simplifies day-1 and day-2 operations, and customers can quickly deploy and extend cloud infrastructure and minimize maintenance disruptions. Stateful containers orchestrated by Kubernetes can leverage storage exposed by vSphere (vSAN, VMFS, NFS) while using standard Kubernetes volume, persistent volume, and dynamic provisioning primitives.

VMware NSX Data Center

VMware NSX Data Center is the network virtualization and security platform that enables the virtual cloud network, a software-defined approach to networking that extends across data centers, clouds, and application frameworks. With NSX Data Center, networking and security are brought closer to the application wherever it’s running, from virtual machines to containers to bare metal. Like the operational model of VMs, networks can be provisioned and managed independent of underlying hardware. NSX Data Center reproduces the entire network model in software, enabling any network topology—from simple to complex multitier networks—to be created and provisioned in seconds. Users can create multiple virtual networks with diverse requirements, leveraging a combination of the services offered via NSX or from a broad ecosystem of third-party integrations ranging from next-generation firewalls to performance management solutions to build inherently more agile and secure environments. These services can then be extended to a variety of endpoints within and across clouds.

CockroachDB

CockroachDB is a distributed SQL database built on a transactional and strongly consistent key-value store. It scales horizontally; survives disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention; supports strongly consistent ACID transactions; and provides a familiar SQL API for structuring, manipulating, and querying data.

CockroachDB is inspired by Google's Spanner and F1 technologies, and the source code is freely available.

Assumptions

When running on VMware Cloud Foundation, CockroachDB VMs will be subject to VMware vSphere vMotion® and leverage vSAN storage resiliency. In this design, we deploy a single CockroachDB cluster in a single VMware Cloud Foundation workload domain. Configuration of the workload domain, networking, virtual machines, and CockroachDB follow the best practices jointly established by VMware and Cockroach Labs.

vSphere vMotion

vSphere vMotion is a zero-downtime live migration that allows you to move an entire running virtual machine from one physical server to another, with no downtime. The virtual machine retains its network identity and connections, ensuring a seamless migration process.

  Transfer the virtual machine’s active memory and precise execution state over a high-speed network, allowing the virtual machine to switch running on the source vSphere host to the destination vSphere host.

vSAN Resilient Storage

vSAN allows the configuration of the number of failures to tolerate (FTT) as a virtual machine policy regulating the number of failures the underlying infrastructure in a cluster can sustain while the VM remains available. When a device failure occurs, vSAN automatically rebuilds the components on the failed device to restore storage resiliency. The FTT number, from 0 to 2, represents the number of simultaneous device failures vSAN can withstand from the time of failure until the rebuild completes.

In vSAN, a device is one of the following types:

  Capacity disk

  Cache disk

  ESXi Host

Environment

This section introduces the general design and configurations:

  Environment Overview

  Hardware Configuration

  Network Configuration

  Software

Overview

The recommended deployment model for VMware Cloud Foundation is the standard architecture model. In the standard model, all VMs required to operate VMware Cloud Foundation and its Workload Domains run in a dedicated management domain (Figure 1). NSX-T edges are used for network services that cannot run on the distributed routers. These services include North/South routing, load balancing, DHCP, VPN, and NAT. Depending on the scale and requirements of the CockroachDB instances, the NSX-T edges required for the Workload Domain can be deployed to a dedicated cluster to further dedicate resources and isolate CockroachDB.

The CockroachDB cluster is composed of multiple VMs, each running a single CockroachDB node. All CockroachDB VMs are deployed in a vSphere cluster running in a workload domain. Provided adequate resources, a single vSphere cluster can support multiple CockroachDB clusters.

Graphical user interface, application

Description automatically generated

Figure 1. Overview of VMware Cloud Foundation Running CockroachDB

Hardware

VMware Cloud Foundation must be deployed on supported server hardware. A complete list of supported hardware and server vendors can be found in the VMware Compatibility Guide (VCG) for vSAN. In addition, consider the following items for server selection:

  Because of the significant performance advantages of all-flash server configurations over hybrid, the use of all-flash servers is highly recommended. All-flash provides the higher throughput and lower latency required to run demanding applications in dense environments. Additionally, the use of at least two disk groups per server is highly recommended. Further details can be found in the vSAN Design Guide under Storage Design Considerations.

  NIC with advanced such as LRO/LSO and RSS that provide significant performance benefits to vSAN, NSX, and applications such as CockroachDB.

In the standard architecture model, a total of 4 servers are required for the management domain and an additional 3 per VI Workload Domain cluster. For optimal vSAN resilience the use of 4 or more servers per vSphere cluster is highly recommended. Additional details about the use of 3 node vSAN clusters can be found in the vSAN Design Guide under Cluster Design Considerations.

Table 1. Server Count per vSphere Cluster

vSphere Cluster

Server Count

Management

4 (minimum)

Workload Domain Compute Cluster

3 (minimum), 4+ recommended

Workload Domain Edge Cluster (optional)

3 (minimum), 4+ recommended

Network

The physical network plays an important role for both vSAN and CockroachDB. The physical network should meet VMware Cloud Foundation requirements and follow the best practices outlined in the vSAN Network Design Guide. For vSAN all-flash configurations, the minimum required network speed is 10Gbps. Because of significant benefits to vSAN and CockroachDB network, speeds of 25Gbps or more are highly recommended.

Networking is deployed based on the VMware Cloud Foundation architecture. Both Management and Workload domains are configured with an NSX-T managed VMware vSphere Distributed Switch. Portgroups are used for management, vMotion, and vSAN traffic. CockroachDB application traffic uses an overlay backed segment. Network traffic is load balanced over two physical NICs according to the teaming policies outlined in Table 2. Additional design considerations when deploying CockroachDB using multiple NSX_T segments can be found in Appendix B – Cockroach Network Topologies and NSX-T Edge Sizing

Diagram

Description automatically generated

Figure 2. Distributed Switches Overview

 

Table 2. Virtual Distributed Switch Teaming Policy

Port Group

Teaming Policy

VMNIC0

VMNIC1

Management

Route based on Physical NIC load

Active

Active

vMotion

Route based on Physical NIC load

Active

Active

vSAN

Route based on Physical NIC load

Active

Active

CRDB

Load Balance Source

Active

Active

Software

CockroachDB on VMware Cloud Foundation requires the following software resources (Table 3).

Table 3. Software Resources

Software

Version

Purpose

VMware Cloud Foundation

4.0 or above

A unified SDDC platform that brings together VMware ESXi, vSAN, NSX, and optionally, VMware vRealize® Suite components, into a natively integrated stack to deliver enterprise-ready cloud infrastructure for the private and public cloud.

See BOM of VMware Cloud Foundation for details.

CockroachDB

20.1 or above

Database

CentOS or RHEL

7.6 or above

8.0 or above

Operating system with support for the VMware Precision Clock.

See Linux Driver for Precision Clock Virtual Device for details.

Platform Validation

Prior to deployment, it is recommended to validate the performance capabilities of the platform. HCIBench is the preferred tool to validate both overall and I/O specific profile performance using synthetic I/O. HCIBench provides the ability to run user-defined workloads as well as a series of pre-defined tests, known as the EasyRun suite. When leveraging EasyRun, the HCIBench appliance executes four different standard test profiles that sample system performance and report key metrics.

Beyond synthetic testing, it is advised to leverage the I/O tool designated by the software vendor. CockroachDB comes with built-in load generators for simulating different types of client workloads, printing per-operation statistics and totals after a specific duration or max number of operations. Users should explore which tests and parameters are best suited to replicate the I/O profiles matching their actual workload. Once tests and optimal parameters are identified, a baseline test should be conducted using a subset of the selected test cases. Running a limited subset allows the user to rapidly expose potential performance anomalies present in the system or configuration while reducing in between test iterations.

To ensure CockroachDB operates as expected on VMware Cloud Foundation, it is validated against a suite of test cases jointly designed by VMware and Cockroach Labs. These test cases are designed to assess CockroachDB operation when subjected to vMotion and host failures.

Validation Environment

CockroachDB was validated on VMware Cloud Foundation using the following environment. Configuration of this environment strictly adheres to VMware Cloud Foundation deployment guidelines and 

Hardware Resources

The validation environment consisted of a total of 12 Dell PowerEdge R640, each with two disk group consisting of one cache-tier NVMe and capacity-tier read-intensive SATA SSDs (Table 4). The environment is deployed according to the standard VMware Cloud Foundation architecture with an additional cluster for the NSX-T edges. Four servers are used for the management domain, four for the NSX-T edge cluster, and the remaining four for the VI Workload Domain running the CockroachDB cluster.

Note: Although the servers are configured with NIC capable of supporting up to 100GbE, the top of rack (TOR) switches only support a maximum speed of 40GbE per port. All server NICs are running at 40GbE, the maximum rate supported by the TOR switches.

Each server node in the cluster had the following configuration.

Table 4. Server Hardware Configuration

PROPERTY

SPECIFICATION

Server model name

Dell PowerEdge R640

CPU

2 x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 48 core each

RAM

768GB

Network adapter

2 x Mellanox MT28800 ConnectX-5 1/10/25/40/50/100Gbps Ethernet Controller

Storage adapter

1 x Dell HBA330 Mini Adapter

Disks

Cache - Dell Express Flash NVMe P4610

Capacity – Intel D3-S4510 RI SATA SSD

Table 5 Server Hardware Configuration

 

TOR networking for the test environment is provided by a pair of Dell S6000-ON switches. These switches provide both L2 switching and L3 routing for all the VLAN required by VMware Cloud Foundation.

Table 6. Dell S6000-ON Switch Characteristics

PROPERTY

SPECIFICATION

Switch model name

Dell S6000-ON

Number of ports

32 x 40GbE QSFP+

Switching bandwidth

Up to 2.56Tbps non-blocking (full duplex)

Logical Network

Networking for the CockroachDB validation cluster is provided through a single overlay backed NSX-T segment with N-S routing through a dedicated pair of NSX-T edges running in the separate edge cluster.

A picture containing text, black, monitor, indoor

Description automatically generated

Figure 3. CockroachDB Validation Logical Network

VM Configuration

See Table 6 for VM configurations. 

Table 6. Validation VM Configuration

PROPERTY

SPECIFICATION

vCPU

16

Memory

64GB

Disk

1 x 500GB using vSAN RAID 1 FTT=1

NIC

1 x VMXNET3

Additional Devices

Precision Clock

Test Cases

Baseline

A performance baseline is obtained by running a 6-hour TPCC workload configured with 1,000 warehouses against a 3 node CockroachDB cluster.

Results

The test completed successfully with no errors or warnings. CockroachDB exhibited consistent performance throughout the workload and achieved 12581.6 TPMC at a 90th percentile latency of 28.5ms.

Table 7. Baseline TPCC Results

Metric

Result

TPMC

12581.6

Efficiency

97.6%

90th Percentile Latency

28.5ms

A picture containing chart

Description automatically generated

Figure 3. Baseline - SQL Queries

vMotion

This test validates CockroachDB tolerance to vMotion by simulating an extreme amount of migration while running a 6-hour TPCC workload. While the TPCC workload is running; every 2 minutes, a CockroachDB VM is selected at random and migrated to another ESXi host.

Results

The test completed successfully with no errors or loss of availability (Figure 4, Figure 5). CockroachDB exhibited consistent performance throughout the test and achieved 12581.6 TPMC at a 90th percentile latency of 104.9ms. The increase in latency is expected given the CockroachDB cluster being constantly subjected to a node migration. With a vMotion occurring every 120 seconds and each vMotion taking an average 40 seconds, the CockroachDB cluster is subject to a VM migration nearly a third of the test duration.

Table 8. vMotion Test TPCC Results

Metric

Baseline

Result

Delta

TPMC

12581.6

12401.0

-180.6

Efficiency

97.6%

96.4%

-1.2%

90th Percentile Latency

28.5ms

104.9ms

+76.4ms

A picture containing line chart

Description automatically generated

Figure 4. vMotion - TPCC SQL Queries

Table

Description automatically generated with low confidence

Figure 5.  vMotion - TPCC SQL Errors

Host Failure with VMware vSphere High Availability

This test validates CockroachDB remains available during a host failure. The workload is a 1-hour TPCC test with 1,000 warehouses against a 3 node CockroachDB cluster. The vSphere cluster has vSphere HA enabled. Around the 15-minute mark an ESXi host running a CockroachDB node is abruptly powered down through the Out-Of-Band management interface (OOB).

Results

The test completed successfully, and vSphere HA restarted the node on one of the remaining ESXi hosts. During the test CockroachDB did not exhibit unexpected behavior or performance anomalies beyond the anticipated impact of CockroachDB node failure. At the point of failure, a momentary decrease in performance is observed followed by a rapid recovery to the pre-failure rate (Figure 6). During the period with a missing CockroachDB node, the CockroachDB cluster performance initially dropped to 55% then settled around 66%, a number proportional to the remaining resources (2 of 3 CockroachDB nodes). A small number of SQL transaction errors were observed but consistent with the loss of the in-flight transactions to the failed node at the moment of catastrophic failure.

A picture containing timeline

Description automatically generated

Figure 6. Host Failure with HA - SQL Queries

Graphical user interface, application

Description automatically generated

Figure 7. Host Failure with HA - SQL Query Errors

Recommendations

This section provides some of the key high-level recommendations based on the joint work between VMware and Cockroach Labs joint.

Server Configuration

Hyperthreading

Hyperthreading technology allows a single physical processor core to behave like two logical processors. The processor can run two independent applications at the same time. While hyperthreading does not double the performance of a system, it can increase performance by better utilizing idle resources leading to greater throughput for certain important workload types.

Recommendation: Hyperthreading enabled

Power Management

ESXi offers power management features supported by the host hardware to adjust the balance between performance and power savings. The ESXi High Performance policy provides more deterministic performance at the cost of lower per-watt efficiency. High Performance policy also reduces maximum Turbo Boost frequencies by disabling deep C-States, which increases processor performance consistency and reduces wake up latencies.

Recommendation: Power management set to “OS Control” and ESXi power management to “High Performance”

ESXi

Time Configuration

To ensure consistent time synchronization between ESXi hosts and other components in the environment, ESXi hosts should synchronize their time and date to either NTP or PTP servers. CockroachDB vMotion support requires accurate and consistent time between all its nodes.

Recommendation: Automatic time synchronization enabled, and the same time source configured on all ESXi hosts

vSphere Cluster

vSphere High Availability

vSphere HA provides high availability for virtual machines by pooling the virtual machines and the hosts they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts.

Recommendation: HA Enabled

VMware vSphere Distributed Resource Scheduler™

DRS works on a cluster of ESXi hosts and provides resource management capabilities like load balancing and virtual machine (VM) placement. DRS also enforces user-defined resource allocation policies at the cluster level, while working with system-level constraints.

Recommendation: DRS set to fully automated with conservative migration threshold or partially automated

DRS VM-VM Affinity Rules

Affinity rules can get used to specify whether a group of VM should run on the same ESXi host or should run separately on different ESXi hosts. In most cases CockroachDB VM belonging to the same CockroachDB cluster should run on separate ESXi hosts to prevent the loss of an ESXI host impacting more than one CockroachDB node. In deployments that leverage geo-partitioning it may be possible or desirable to keep VM together. Traffic between VMs on the ESXi host may see improved throughput and latency since the traffic would not traverse the physical network. Regardless of design, affinity rules should be carefully evaluated and implemented to ensure the physical failure domains match CockroachDB expectations.

Recommendation: Create affinity or anti-affinity rules to enforce VM placement that matches failure domain expectations

vSAN Storage Policy

Storage Policy

The Number of Failures to Tolerate capability addresses the key customer and design requirement of availability. With FTT, availability is provided by maintaining replica copies of data, to mitigate the risk of a host failure resulting in lost connectivity to data or potential data loss.

Recommendation: RAID1 FTT=1 or greater

vSAN Deduplication and Compression

Deduplication and Compression can enhance space savings capabilities; however, for optimal performance we do not recommend enabling Deduplication and Compression.

Recommendation: Disable Deduplication and Compression.

vSAN Encryption

vSAN can perform data at rest encryption. Data is encrypted after all other processing, such as deduplication, is performed. Data at rest encryption protects data on storage devices, in case a device is removed from the cluster. Use encryption as per your company’s Information Security requirements.

Recommendation: Enable Encryption as required per your InfoSec

Virtual Machine

vCPU

Each vCPU is seen as a single physical CPU core by the VM's operating system. Cockroach Labs recommends a minimum of 4 vCPU per VM to ensure sufficient compute resources. In multi-socket systems, when possible VMware recommends setting the number of vCPU to fit on the least amount of NUMA nodes[1]. For instance, in dual socket systems setting the number of vCPU to be lesser or equal than the number of cores per socket. If the sizing must span NUMA nodes, vSphere will automatically optimize.

Recommendation: Set the VM vCPU count between 4 and the number of cores per socket (pCPU).

Memory

The amount of memory presented to the guest operating system. Cockroach Labs sets the minimum ration of 2GB per vCPU. The recommended ratio is at least 4GB per vCPU. For instance, a VM with 16vCPU should have at least 64GB of memory.

Recommendation: Set at least 4GB of memory per vCPU.

Precision Clock

The VMware Precision Clock is a new type of virtual device available in ESXi 7.0 (hardware version 17 on-wards) that provides virtual machines with access to the underlying ESXi host's system clock. Using the Precision Clock provides the VM with a reliable time source not affected by external VM operations such as vMotion.

Requirement: Configure a VMware Precision Clock

Operating System

PTP driver

The ptp_vmw driver is a required Linux driver for VMware Precision Clock. This driver will be included in the upcoming Linux distribution releases. If the driver is not bundled source can be downloaded from the VMware Linux Driver for Precision Clock Virtual Device site, then compiled for the current kernel.

Requirement: The ptp_vmw driver must installed and loaded.

Chrony

Chrony is a versatile implementation of the Network Time Protocol (NTP) use to synchronize a system clock with NTP servers, reference clocks, or manually using wristwatch and keyboard. Configuring Chrony and setting the Precision Clock as the time source ensures the VM system clock is consistent with other VM

Requirement: Chrony enabled and configured to use the VMware Precision Clock.

CockroachDB

Clock Device Flag

The --clock-device flag for cockroach start and cockroach start-single-node identifies a PTP hardware clock for querying current time. This is needed in cases where the host clock is unreliable or prone to large jumps such as when performing a vMotion. For additional details about the CockroachDB time synchronization, see Appendix A – CockroachDB Timekeeping Architecture in vSphere 7.

Requirement: Cockroach must be started using the --clock-device flag set to the VMware Precision Clock.

Conclusion

VMware Cloud Foundation delivers flexible infrastructure that enhances the capabilities of CockroachDB. With SPBM, VMware Cloud Foundation can scale performance for both department and enterprise level clouds. Data-at-rest encryption meets both operational and regulatory compliance. NSX-T load balancing offers high-availability by transparently distributing client load across a dynamic pool. CockroachDB on VMware Cloud Foundation with vSAN enables administrators to effortlessly scale their environments in real-time, allowing enterprises to easily scale-up and scale-down as needed. CTO’s and CFO’s budget objectives can be achieved by adopting a platform that delivers simplicity, efficiency, and reliability.

About the Authors

Charles Lee, Solutions Architect in the VMware Cloud Platform Business Unit and Chen Wei, Staff Solutions Architect in the VMware Cloud Platform Business Unit wrote this report with contributions from the following members.

  • Alex Entin, Enterprise Architect, Cockroach Labs
  • Rachel Zheng, Product Line Marketing Manager in the Storage Product Marketing of the Cloud Platform Business Unit
  • Steven Tumolo, Staff Consulting Architect, VMware PSO

 

Appendix A – CockroachDB Timekeeping Architecture in vSphere 7

CockroachDB is a strongly consistent database that guarantees serializable SQL transactions. To achieve this, CockroachDB requires moderate levels of clock synchronization to preserve data consistency. For this reason, when a node detects that its clock is out of sync greater than the maximum offset (default 500ms) or with at least half of the other nodes in the cluster by 80% of the maximum offset (default 400ms), it spontaneously shuts down[2]. In a properly configured physical environment, this type of event is rare. However, in virtual environments with abstraction and workload mobility, the possibility for time variation is much more likely.

VMware vMotion allows the live migration of a VM from one physical host to another. Relocating a VM provides significant benefits to administrators by facilitating maintenance operations and allowing an even distribution of workloads across the hosts in a vSphere cluster. The process of live migrating a VM involves a number of steps that can momentarily impact the performance and timekeeping of the VM. At one critical phase during a vMotion, the VM is suspended to allow the transfer of the remaining memory pages to the target host. While suspended, the VM system clock is effectively paused and falls behind by roughly the delta between when the VM is suspended and when it resumes (Figure 8).

 

A screenshot of a computer

Description automatically generated with low confidence

Figure 8. VM Clock Jump during vMotion

Upon resuming one of three scenarios can occur:

  The time synchronization process adjusts the VM system clock before CockroachDB executes. When the CockroachDB process executes is within tolerances with the other CockroachDB nodes, no actions are taken.

  CockroachDB executes before the system clock is adjusted but the VM system time is within tolerance with the other CockroachDB nodes. Upon executing, CockroachDB sees a jump in time but no corrective actions are taken. Time synchronization eventually reconciles the system clock and there are no further effects.

  CockroachDB executes before the system clock is adjusted and the VM system time is out of tolerance with other CockroachDB nodes (Figure 9). CockroachDB initiates actions to prevent inconsistencies. Shortly after the time synchronization process runs, the system clock returns within tolerances. CockroachDB sees the time within tolerance and reverts back to normal operation. During the period where the system time is out of tolerance, performance is impacted and transaction retries can occur.

 

Graphical user interface

Description automatically generated

Figure 9. System Clock Out of Tolerance

Recent vSphere releases introduce major enhancements to vMotion performance yet these improvements alone are not sufficient to eliminate the possibility of an undesirable race condition outcome involving the system clock. Other options to reduce the likelihood of CockroachDB taking action after a vMotion include tuning global vMotion parameters and CockroachDB tolerances. While both these approaches can further reduce the likelihood, they cannot completely eliminate the possibility and come at a cost to performance.

Solving the system clock race condition requires a different approach pioneered through advancements to both VMware and CockroachDB: the VMware Precision Clock device and CockroachDB external clock synchronization.

The release of virtual hardware version 17 in vSphere 7 introduces the Precision Clock virtual device. The Precision Clock allows a guest OS to directly access the system time of the underlying ESXi host. Unlike the virtual system clock running in the VM, querying the ESXi host system clock provides a reliable time source that is unaffected by operations to the VM, such as a vMotion. Beyond providing a dependable time source, the Precision Clock also offers a better achievable time accuracy by circumventing the virtual networking paths normally used by in-guest network-based time synchronization.

In tandem with the introduction of the VMware Precision Clock, CockroachDB 20.1 introduces the ability to query a Precision Clock device instead of the VM system clock. When enabled through the --clock-device flag CockroachDB queries exclusively the PTP clock to obtain its time reference.

 

Timeline

Description automatically generated

Figure 10. Updated Timekeeping Architecture for CockroachDB in vSphere 7

In the updated timekeeping architecture both the guest time synchronization process (chrony) and CockroachDB are configured to use the VMware Precision Clock device (Figure 10). Setting both guest time synchronization and CockroachDB to the same time source ensures the time reference remains consistent for all processes running in the VM. As the CockroachDB VM is migrated from the source to the target ESXi host, the clock reference remains accurate and tightly bounded to the external clock source (Figure 11. CockroachDB vMotion in vSphere using Precision Clock).

A picture containing graphical user interface

Description automatically generated

Figure 11. CockroachDB vMotion in vSphere using Precision Clock

Enabling support for vMotion for CockroachDB using the updated time synchronization architecture requires the implementation of the items outlined in the following table (Table 9). When properly implemented joint testing between VMware and Cockroach Labs has validated this architecture results in only a small transient increase in latency because of the vMotion switchover time. The momentary increase in application latency is expected and line with the normal impact of a vMotion.

Table 9. Cockroach vMotion Support Checklist

Requirement

Version

Note

VMware vSphere

7.0 or above

VMware Precision Clock introduced in virtual hardware version released in vSphere 7.

VM virtual hardware

17 or above

VM with virtual hardware version 17 or above, and a precision clock device added.

 

See Add a Precision Clock Device to a VM

ESXi time synchronization

n/a

All ESXi hosts configured to use time synchronization and set to the same time synchronization source(s)

 

See Editing the Time Configuration of a Host

CentOS or RHEL

7.6 or above

8.0 or above

Operating system with support for the VMware Precision Clock with the Precision Clock driver loaded and running.

 

See Linux Driver for Precision Clock Virtual Device

Chrony

n/a

 

Chrony installed and configured to use the PTP Hardware Clock (PHC).

 

See Chrony Configuration Manual

CockroachDB

20.1 or above

CockroachDB launched using external time synchronization flag set to the precision clock device (e.g. --clock-device=/dev/ptp0).

 

See Cockroach Start

 

Appendix B – Cockroach Network Topologies and NSX-T Edge Sizing

CockroachDB is a symmetric shared-nothing application that depends on a fast and reliable network to handle inter-node communication. The volume of inter-node traffic can be substantial and network bottlenecks will adversely impact cluster performance. Sizing of network resources is critical and must take into account steady state and exceptional conditions such as rebuilds.

When considering the flow of network traffic in NSX-T, the two major components of the architecture are Distributed Routers (DR) and Services Routers (SR). Distributed Routers (DR), as their name implies, run on each transport node and provide distributable services such as handling E-W traffic. Service Routers (SR) are instantiated on the NSX-T edges and used to provide all other network services that are not provided by the Distributed Routers (DR). Among the services provided by the Service Routers (SR) is N-S routing. Because Service Routers (SR) are instantiated on the NSX-T edges, they must be sized by considering N-S traffic requirements.

When deploying CockroachDB using overlay backed network segments, the choice of topology impacts the flow of traffic and areas to consider for possible bottlenecks. In this section, we outline the major deployment models and highlight their impact on inter-node traffic flow.

Single NSX-T segment

With CockroachDB nodes deployed to a single NSX-T segment, all inter-node communication flows E-W and handled by the in-kernel NSX-T module running on each ESXi host (Figure 12). In this type of deployment, sizing of the NSX-T edges only needs to account for the management traffic, client traffic, and network services such as DHCP or load balancing. This deployment does not have any potential logical networking hot spot with regards to inter-node traffic. As the number of ESXi hosts and CockroachDB nodes increase, the traffic load is distributed across all the ESXi hosts running CockroachDB VM.

A picture containing timeline

Description automatically generated

Figure 12. NSX-T Single Segment Logical Topology

Multiple NSX-T segments with a single transport zone

With CockroachDB nodes deployed across multiple NSX-T segments, sharing the same transport zone inter-node communication flows E-W and handled by the in-kernel NSX-T Distributed Router (DR) running on each ESXi host (Figure 13). In this type of deployment, sizing of the NSX-T edges only needs to account for management traffic, client traffic, and network services such as DHCP or load balancing. This deployment does not have any potential networking hot spot with regards to inter-node traffic. As the number of ESXi hosts and CockroachDB nodes increase, the traffic load is distributed across the ESXi hosts running CockroachDB VM.

Timeline

Description automatically generated

Figure 13. NSX-T Multiple Segments Single Transport Logical Topology

Multiple Transport Zones

When CockroachDB nodes are located in different transport zones, inter-node communication flows in both the E-W and N-S planes (Figure 14). Inter-node traffic between nodes in the same transport zone is handled by the Distributed Router (DR) and remains on E-W plane. Inter-node traffic between nodes in different transport zones is handled by a Service Router (SR) and transits in the N-S plane through the NSX-T edges. In this design, sizing of the NSX-T edges and physical layer 3 network must account for steady-state and exceptional inter-node traffic. If the cluster is expanded, the edge sizing needs to be re-evaluated.

Graphical user interface, application

Description automatically generated

Figure 14. NSX-T E-W and N-S Routing

For additional in-depth information, see NSX-T: Routing where you need it.

 

 


 

Filter Tags