CockroachDB vMotion Support on vSphere 7 using Precise Timekeeping

April 19, 2021

CockroachDB is a scalable distributed SQL database built on a transactional and strongly consistent key-value store. Like its namesake, it is designed to be resilient and operate autonomously with minimal user intervention. CockroachDB independently detects problems and takes action to ensure data integrity. To accomplish this, CockroachDB continuously monitors its nodes and makes millisecond decisions in the event of an anomaly.

Making millisecond decisions in a distributed system requires a reasonable amount of clock synchronization between the nodes in a cluster. For this reason, if a node detects that its clock is too out of sync with other nodes, it either cannot join the cluster or immediately shuts down if it is part of one. Because CockroachDB offers a strong consistency model, this sensitivity to time drift is required to guarantee the ordering of transactions

CockroachDB monitors against two types of clock drift. 

  1. The time offset with any other node is greater than the configured maximum (500ms by default). In short, if the difference between the local time of a node is greater than the max offset compared to the time on any other node in the cluster, the node cannot participate in the cluster (Figure 1)
  2. The time offset with more than half the other nodes exceeds 80% of the configured maximum (400ms by default). In this case, if the difference between the local time on a node is greater than 80% of the max offset with more than half the other nodes, the node cannot participate in the cluster (Figure 2)

image-20210419090623-1

Figure 1 Node clock offset out of tolerance

 

image 59

Figure 2 Node clock offset out of tolerance with 50% or more nodes

In a physical environment with properly configured time synchronization, clock jumps are not a common occurrence. Other than a configuration mistake, a problem with the time sources, or a hardware failure, there are few reasons for a sudden jump in the system clock. However, in a virtual environment with workload mobility and virtual devices, momentary discrepancies in a system's local time are quite common. In particular, the live migration of a VM from one host to another results in a short period where the system clock falls out of sync. In most cases, this is not an issue. However, in a distributed system with strict timing tolerances, this brief period can be enough to cause problems.

Performing a vMotion is a process with several steps. During the final steps of a vMotion, the VM is suspended so the remaining memory pages can transfer to the target host. In this brief period, typically in the range of a few hundreds of milliseconds, the VM system clock is paused and falls behind by the time required for the transfer (Figure 3). When the migration completes, the VM resumes, and the operating system restarts scheduling processes These processes include CockroachDB and the operating system clock synchronization.

image 60

Figure 3. VM clock during vMotion

Depending on the ordering of the processes when the system resumes, one of three scenarios can occur:

  1. Time synchronization runs first and adjusts the VM system clock before CockroachDB executes. When CockroachDB runs, time is within tolerances, and there are no adverse actions.
  2. CockroachDB runs first, but the offset is within tolerance. CockroachDB sees a more significant than usual jump in time, but no corrective actions occur. Time synchronization eventually updates the system clock with no further impact.
  3. CockroachDB runs first, and the system time is out of tolerance with other CockroachDB nodes (Figure 9). CockroachDB immediately initiates actions to prevent possible inconsistencies. Soon after, the time synchronization process runs, the system clock returns within tolerances, and the CockroachDB node reverts to normal operation. Despite the shortness of the event, we observe an impact on performance. 

The third scenario is the undesirable outcome that deters from using vMotion with CockroachDB. Not only does it result in degraded performance, but it can impact availability if too many nodes migrate simultaneously.

Initial attempts to solve this problem explored tuning vMotion, time synchronization, and CockroachDB. None of these approaches alone, or in combination, proved satisfactory in solving the timekeeping race condition.

  • vSphere 7 introduces significant improvements to vMotion performance. However, despite the reduction in the overall vMotion time, the enhancements do not eliminate the root cause. Since the VMs still need to be paused for the final transfer, their system clocks still fall behind, and the race condition remains. Eliminating the root cause requires vMotion to guarantee the time suspended is consistently below the default CockroachDB maximum offset. Although we can tune the maximum allowable time for vMotion to complete the final transfer, this has caveats. The shorter the allowed vMotion switchover time, the greater the amount of active memory that needs to be transferred beforehand. Not only does this adversely impact vMotion performance from having to transfer more data, but it affects all VM in the vSphere cluster. Worse yet, it can even result in a vMotion never completing if the transfer rate cannot keep up with active memory page use.
  • Solutions around time synchronization are also unsuccessful at addressing the root cause. The issue with traditional time synchronization approaches is not whether they are accurate enough or get the system clock back on track, but when this happens. Whether using chrony or VMware tools, these processes run alongside CockroachDB. When the VM resumes, if the scheduler runs CockroachDB before the clock synchronization process, the time read is behind, and CockroachDB can panic. 
  • CockroachDB has tunable parameters, including the maximum time offset. Increasing the maximum offset provides a larger window for the transfer to complete and reduces the likelihood of unwanted mitigation. Unfortunately, while this approach reduces the risk, it does completely eliminate it. Worse yet, testing revealed that increases to the maximum time offset came with considerable penalties to CockroachDB performance.

Precise Time References

Solving the system clock race condition uses a different approach pioneered through advancements to both VMware vSphere and CockroachDB.

The release of virtual hardware version 17 in vSphere 7 introduces the new Precision Clock virtual device. The Precision Clock allows processes in the guest OS to query the system clock of the underlying ESXi host. Unlike the VM virtual system clock, the key difference is the ESXi host system clock remains unaffected by operations to the VM. Provided the source and target ESXi hosts are configured with the same time source, the Precision Clock offers a reliable global time source across all hosts (Figure 4).

image-20210419090623-4

Figure 4 CockroachDB vMotion using the  Precision Clock

Building on this new feature, CockroachDB 20.1 introduces the --clock-device flag that instructs CockroachDB to rely exclusively on a Precision Clock device as its time source. Since the Precision Clock offers a reliable global time source and CockroachDB no longer needs the VM system clock, ordering of the processes no longer matters. In what was our previous worst case, CockroachDB running before the VM time synchronization process, using the Precision Clock device, we always get the correct time from the ESXi host system clock. To ensure consistent time references across all the processes in the guest VM, the VM system clock synchronization (e.g. chrony) is also configured to use the Precision Clock device a ensures (Figure 5).

 

image 61

Figure 5 Updated Timekeeping Architecture for CockroachDB in vSphere 7

Requirements

Enabling vMotion support for CockroachDB requires deploying CockroachDB with the guidelines outlined in the following table (Table 1).

Table 1. Cockroach vMotion Support Checklist

Requirement

Version

Note

VMware vSphere

7.0 or above

VMware Precision Clock introduced in virtual hardware version released in vSphere 7.

VM virtual hardware

17 or above

VM with virtual hardware version 17 or above, and a precision clock device added.

 

See Add a Precision Clock Device to a VM

ESXi time synchronization

n/a

All ESXi hosts configured to use time synchronization and set to the same time synchronization source(s)

 

See Editing the Time Configuration of a Host

CentOS or RHEL

7.6 or above

8.0 or above

Operating system with support for the VMware Precision Clock with the Precision Clock driver loaded and running.

 

See Linux Driver for Precision Clock Virtual Device

Chrony

n/a

 

Chrony installed and configured to use the PTP Hardware Clock (PHC).

 

See Chrony Configuration Manual

CockroachDB

20.1 or above

CockroachDB launched using external time synchronization flag set to the precision clock device (e.g. --clock-device=/dev/ptp0).

 

See Cockroach Start

Conclusion

The updated timekeeping architecture for CockroachDB offers a solution that is simple, efficient, and comletely eliminates the root cause of our initial problem. When properly implemented, joint testing between VMware and Cockroach Labs has validated that this architecture eliminates the undesirable behavior with only a slight increase in SQL transaction latency during a vMotion. This temporary increase in latency is expected and in line with normal operation of a vMotion.

Additional Information

Filter Tags

Modern Applications vSphere vSphere 7 VMware Tools Blog Deployment Considerations Operational Tutorial Intermediate