Asymmetrical vSAN Clusters - What is Allowed, and What is Smart
Can hosts in a vSAN cluster contribute different amounts of resources? Yes, this is supported, but is it smart? Let's sort through this problem to better understand the challenges they introduce, and the guidance to help minimize the effects of asymmetry on your environment.
What is an asymmetrical cluster?
An asymmetrical cluster is a cluster of hosts where one or more hosts contribute a different amount of resources than the other hosts in a cluster. For vSphere clusters, "resources" could mean CPU, memory, and network, while vSAN clusters add storage capacity and performance to the definition. vSphere clusters AND vSAN clusters have always supported asymmetrical designs. This offers tremendous flexibility as you introduced newer, more powerful hosts to a cluster.
Symmetrical clusters, of which the hosts contribute equal amounts of resources across the cluster, are superior. Why? Uniformity of nodes is a pre-condition for a cluster to behave optimally under normal operating conditions and in failure conditions. For vSphere clusters, asymmetrical configurations come with consequences in operations.
- Normal operating conditions. Resources may be used inefficiently. For example, a host with more compute and memory resources than other hosts but equal amounts of network uplink bandwidth may be constrained by this limit.
- Host failure conditions. May result in insufficient resources. For example, a failure of the host with substantially more resources than other hosts in the cluster could lead to insufficient resources to power the workloads on the failed host.
This is the reason why the uniformity of host resources in a cluster is a trait of good clustering principles.
Asymmetry with Distributed Storage
The uniformity of hosts in a cluster is even more important for vSAN clusters. Why? Distributed storage systems like vSAN have additional responsibilities not found in a vSphere cluster providing compute and memory resources.
- Data storage must adhere to anti-affinity rules to maintain resilience. A VM instance is usually free to roam anywhere on the cluster to use CPU and memory resources. Not so with data. Upon a failure of a host, some hosts in the cluster will be ineligible to hold some data because it already holds the mirrored copy of that data on the same host, which would violate resilience settings.
- Data and its use of storage capacity persist. Unlike CPU and memory demands that fluctuate from almost non-existent to high usage, and can eventually be served by the resources available albeit with decreased performance, the need for storage capacity persists. Even if there is very little demand for storage resources for the data, the data must always be available.
vSAN's object manager takes away all of the complexities around data placement in a resilient manner by knowing where the data can be placed. It performs placement automatically by looking at the free capacity available across the hosts in the cluster but will rule out the hosts that have the copy of the same data from consideration, to ensure the resilience of the data across the cluster. This is known as anti-affinity, as shown in Figure 1.
Figure 1. Demonstrating vSAN's automated approach to ensuring the resilience of data.
At first glance, the concepts around anti-affinity sound overly restrictive to capacity utilization in a cluster. In reality, it is not - unless it is an extremely small cluster. Figure 1 shows a single object, but a VM is comprised of many objects, which means there can be hundreds or thousands of objects in a cluster. Because of this, vSAN can distribute the data quite evenly across a cluster while maintaining anti-affinity.
To better understand the need of maintaining sufficient storage capacity in failure scenarios, let us step through some examples that help illustrate symmetrical and asymmetrical cluster configurations.
The following examples will demonstrate how variables such as the percentage of capacity consumed, the number of hosts in a cluster, and the severity of asymmetry will affect capacity in host failure conditions. For simplicity, the examples will only show the effects of a single host exhibiting non-uniformity, and do not show the recommended free capacity for operations reserve and host rebuild reserve.
Recommendation: Use the vSAN Sizer to determine the correct amount reserved capacity for a cluster, and enable optional “Operations Reserve” and “Host Rebuild Reserve” in vCenter Server to ensure a cluster is operating within the recommended levels of free capacity.
Example #1: 5-host cluster with 50% capacity utilization and one host failure. Fully symmetrical. Figure 2 shows that upon the failure of any single host in the cluster, the base cluster capacity utilization on the remaining hosts would increase from 50% to over 62%.
Figure 2. Failure of a host in this symmetrical cluster increases utilization to over 62%.
Example #2: 5-host cluster with 50% capacity utilization and one host failure. One host has 2x the amount of storage capacity as the other hosts. Figure 3 shows that upon the failure of the host with more storage resources, the base cluster capacity utilization would increase from 50% to about 75%.
Figure 3. Failure of a host in this asymmetrical cluster increases utilization from 50% to about 75%.
Example #3: 5-host cluster with 50% capacity utilization and one host failure. One host has 3x the amount of storage capacity as the other hosts. Figure 4 shows that upon the failure of the host with even more asymmetry, the base cluster capacity utilization would increase from 50% to over 87%.
Figure 4. Failure of a host in this severely asymmetrical cluster increases utilization from 50% to over 87%.
Example #4: 5-host cluster with 66% capacity utilization and one host failure. One host has 3x the amount of storage capacity as the other hosts. Figure 5 shows that by a simple increase in the base capacity utilization (from 50% to 66%), the same failure condition noted above would run out of capacity resources, needing more than 115% of the capacity available.
Figure 5. Failure of a host in this severely asymmetrical cluster is unable to provide sufficient free space.
Example #5: 16-host cluster with 66% capacity utilization and one host failure. One host has 3x the amount of storage capacity as the other hosts. Now let's look at the impact of the above condition, but occurring on a cluster of 16 hosts instead of 5 hosts. Figure 6 shows that upon failure of that large host, the base capacity utilization would increase from 66% to just under 80%.
Figure 6. Failure of a host in this severely asymmetrical cluster with more hosts increases utilization to just under 80%.
The host failure examples above clearly show how the capacity consumed, the severity of asymmetry, and the host count of a cluster all play important factors in whether there are sufficient resources in the event of a host failure.
Recommendation: Pay attention to the Skyline health check alert. "What if the most consumed host fails." It is an automated health check that actively simulates the results of a failure of a host with the most consumed resources and shows if the configuration is in an error state.
Is Asymmetry Okay?
For host failure scenarios, the correct answer to whether or not asymmetry is okay generally comes down to the following, ranked in order of importance.
- Percentage of capacity consumed. Perhaps the most overlooked variable of the discussion on asymmetry. One could have a severe level of asymmetry, but if only a small percentage of capacity is used, it will likely not be an issue. As demonstrated in Example #4, it can quickly become a problem as the capacity usage increases.
- The number of hosts in a cluster. The host count determines the percentage of cluster resources removed upon the failure of a host. This tends to be more problematic in clusters with very few hosts.
- The severity of the asymmetry. Does one host have just 50% more capacity than the other hosts in the cluster, or does it have 200% more? The severity of the asymmetry will be a factor in whether the cluster can tolerate a host failure.
The first two are factors above apply to a fully symmetrical cluster. The only difference is the third factor: The severity of asymmetry. The effects of asymmetry in terms of failure handling show up at the extremes, such as a combination of an extremely small cluster, a cluster extremely full, or the asymmetry itself is extreme. A reasonable level asymmetry, or ideally, a fully symmetrical cluster greatly reduces or eliminates this issue, and reduces the complexity of operationalizing this type of cluster.
Recommendation: If the host capacities are asymmetrical to some degree, make sure the host count in the cluster is well beyond the minimum host count required for the storage policy. This will allow for increased flexibility in relocating this data in the event of a host failure.
Asymmetry of Disk Groups
Hosts using multiple disk groups should strive for uniformity of the disk groups within each host, but the asymmetry of disk groups within each host doesn’t have as dramatic of an impact on failure handling, as it is a smaller unit of storage contributing to the cluster.
Non-uniform disk group sizes within a host can post some challenges in very small configurations, such as a 2-node cluster found in Figure 7. Each host in this example provides the same amount of capacity using 2 disk groups, but each host has one disk group that provides twice the capacity of the other disk group. In this scenario, with just 33% of the capacity utilized, the failure of a larger disk group in this configuration may lead to an out-of-capacity or non-compliance condition, as the data has a limited location in which to place the copy of the data from the other host.
Figure 7. Failure of a disk group using an asymmetrical disk group configuration in a 2-Node cluster.
Recommendation: If you are acquiring more or larger capacity devices for your vSAN hosts, strive to distribute those new storage devices across hosts, then distribute across disk groups within a host. This will prevent any asymmetry that comes from adding them in just one or two hosts.
A cluster is a boundary of shared resources, and because of that, an even level of resources across each host in the cluster is a good practice for any clustering resource type, whether it be CPU, memory, or storage resources in vSAN. VMware recommends uniformly configured hosts across a vSAN cluster but can accommodate reasonable levels of asymmetry as new hardware is introduced into a cluster.