Should Automatic Rebalancing be Enabled in a vSAN Cluster?
vSAN 6.7 U3 introduced a new method for automatically rebalancing data in a vSAN cluster. Some customers have found it curious that this feature is turned off by default in 6.7 U3 as well as vSAN 7. Should it be enabled in a VCF or vSAN environment, and if so, why is it turned off by default? Let's explore what this feature is, how it works, and learn if it should be enabled.
Rebalancing in vSAN Explained
The nature of a distributed storage system means that data will be spread across participating nodes. vSAN manages all of this for you. Its cluster-level object manager is not only responsible for the initial placement of data, but ongoing adjustments to ensure that the data continues to adhere to the prescribed storage policy. Data can become imbalanced for many reasons: Storage policy changes, host or disk group evacuations, adding hosts, object repairs, or overall data growth.
vSAN's built-in logic is designed to take a conservative approach when it comes to rebalancing. It wants to avoid moving data unnecessarily. This would consume resources during the resynchronization process and may result in no material improvement. Similar to DRS in vSphere, the goal of vSAN's rebalancing is not to strive for perfect symmetry of capacity or load across hosts, but to adjust data placement to reduce the potential of contention of resources. Accessing balanced data will result in better performance as it reduces the potential of reduced performance due to resource contention.
vSAN offers two basic forms of rebalancing:
- Reactive Rebalancing. This occurs when vSAN detects any storage device that is near or at 80% capacity utilization and will attempt to move some of the data to other devices that fall below this threshold. A more appropriate name for this might be "Capacity Constrained Rebalancing." This feature has always been an automated, non-adjustable capability.
- Proactive Rebalancing. This occurs when vSAN detects any storage device is consuming a disproportionate amount of its capacity in comparison to other devices. By default, vSAN looks for any device that shows a delta of 30% or greater capacity usage than any other device. A more suitable name for this might be "Capacity Symmetry Rebalancing." Prior to vSAN 6.7 U3, this feature was a manual operation but has since been in introduced as an automated, adjustable capability.****
Rebalancing activity only applies to the discrete devices (or disk groups) in question, and not the entire cluster. In other words, if vSAN detects a condition that is above the described thresholds, it will move the minimum amount of data from those disks or disk groups to achieve the desired result. It does not arbitrarily shuffle all of the data across the cluster. Both forms of rebalancing are based entirely off of capacity usage conditions, not load or activity of the devices.
The described data movement by vSAN will never violate the storage policies prescribed to the objects. vSAN's cluster-level object manager handles all of this so that you don't have to.
Manual Versus Automated Operations
Before vSAN 6.7 U3, Proactive Rebalancing was a manual operation. If it detected a large variance, it would trigger a health alert condition in the UI, which would then present a "Rebalance Disks" button to remediate the condition. If clicked, a rebalance task would occur at an arbitrary time within the next 24 hours.
Earlier editions of vSAN didn't have the proper controls in place to provide this as an automated feature. Clicking on the "Rebalance Disks" left some users uncertain if and when anything would occur. With the advancement of a new scheduler and Adaptive Resync introduced in 6.7, as well as all-new logic introduced in 6.7 U3 to calculate resynchronization completion times, VMware changed this feature to be an automated process.
The toggle for enabling or disabling this cluster-level feature can be found in vCenter, under Configure > vSAN > Services > Advanced options > "Automatic Rebalance" as shown in Figure 1.
Figure 1. Configuring "Automatic Rebalance" in the "Advanced Options" of the cluster.
RECOMMENDATION: Keep the "Rebalancing Threshold %" entry to the default value of 30. Decreasing this value could increase the amount of resynchronization traffic and cause unnecessary rebalancing for no functional benefit.
The "vSAN Disk Balance" health check was also changed to accommodate this new capability. If vSAN detects an imbalance that meets or exceeds a threshold while automatic rebalance is off, it will provide the ability to enable the automatic rebalancing, as shown in Figure 2. The less-sophisticated manual rebalance operation is no longer available.
Figure 2. Remediating the health check condition when Automatic Rebalancing is off.
Once the Automatic Rebalance feature is enabled, the health check alarm for this balancing will no longer trigger and rebalance activity will occur automatically.
Accommodating All Environments and Conditions
The primary objective of proactive rebalancing was to more evenly distribute the data across the discrete devices to achieve a balanced distribution of resources, and thus, improved performance. Whether the cluster is small or large, automatic rebalancing through the described hypervisor enhancements addresses the need for the balance of capacity devices in a scalable, sustainable way.
Other approaches are wrought with challenges that could easily cause the very issue that a user is trying to avoid. For example, implementing a time window for rebalancing tasks would assume that the associated resyncs would always impact performance – which is untrue. It would also assume the scheduled window would always be sufficiently long enough to accommodate the resyncs, which would be difficult to guarantee. This type of approach would delay resyncs unnecessarily by artificial constraints, increase operational complexity, and potentially decrease performance.
Should Automatic Rebalancing Be Enabled?
Yes, it is recommended to enable the automatic rebalancing feature on your vSAN clusters. When the feature was added in 6.7 U3, VMware wanted to introduce the capability slowly to customer environments and remains this way in vSAN 7. With the optimizations made to our scheduler and resynchronizations in recent editions, the feature will likely end up enabled by default at some point.
There may be a few rare cases in which one might want to temporarily turn off automatic rebalancing on the cluster. Adding a large number of additional hosts to an existing cluster in a short amount of time might be one of those possibilities, as well as perhaps nested lab environments that are used for basic testing. In most cases, automatic rebalancing should be enabled.
Viewing Rebalancing Activity
The design of vSAN's rebalancing logic emphasizes a minimal amount of data movement to achieve the desired result. How often are resynchronizations as the result of rebalancing occurring in your environment? The answer can be easily found in the disk group performance metrics of the host. Rebalance activity will show up under the "rebalance read" and "rebalance write" metrics An administrator can easily view the VM performance during this time to determine if there was any impact on guest VM latency. Thanks to Adaptive Resync, even under the worst of circumstances, the impact on the VM will be minimal. In production environments, you may find that proactive rebalancing does not occur very often.
The automatic rebalancing feature found in VCF environments powered by vSAN 6.7 U3 and vSAN 7, is a powerful new way to ensure optimal performance through the proper balance of resources and can be enabled without hesitation.