About a year ago, I published "New Design and Operation Considerations for vSAN 2-Node Topologies" to help users take advantage of the new features and functionality with 2-node clusters introduced with vSAN 7 U1. With new capabilities introduced to 2-node clusters in vSAN 7 U3, let's look at how to incorporate these latest enhancements to any of your design and sizing exercises.
The design, sizing, and operational considerations discussed in this post will focus on the introduction of secondary levels of resilience - sometimes referred to as "nested fault domains," witness host lifecycle management, and Adaptive Quorum Control.
2-Node Clusters using Nested Fault Domains - Sizing for Capacity
When designing a 2-node cluster, a typical assumption is that most of the workloads will be protected using a storage policy with a failure to tolerate (FTT) of 1 using a simple RAID-1 mirror. Thus, for every Gigabyte (GB) of data stored, it will consume 2x the capacity to ensure resilience. As a result, the effective capacity of a 2-node cluster has historically been about half of the total capacity advertised by the cluster - not accommodating for small amounts of overhead, free space, etc. You will see this visually on the nice "What if Analysis" section of the vSAN Capacity view in vCenter Server.
Figure 1. Effective free space versus total cluster capacity
This made guidance for sizing 2-node clusters straightforward. The only exception to this rule was for VM object data that used an FTT=0 in combination with host affinity, but this non-resilient data placement scheme is rarely used enough to be incorporated into sizing exercises.
In vSAN 7 U3, sizing exercises should accommodate for the potential for using secondary levels of resilience. This new feature allows for the data to not only be resilient across the hosts but resilient within each host. By using multiple disk groups in each host, a storage policy can be prescribed that ensures the availability of data if a host failed, followed by a disk group failure in the remaining host, as shown in Figure 2.
Figure 2. Storage policy prescribing host mirroring and a secondary of resilience in a vSAN 2-node cluster
The following points summarize the considerations when sizing for capacity for a 2-node cluster.
- Data mirrored across hosts will consume 2x the capacity.
- Data mirrored across the hosts and mirrored within each host through nested fault domains will consume 4x the capacity.
- Data mirrored across the hosts and using RAID-5 for nested fault domains within each host will consume 2.66x the capacity.
The ratios above apply to only the data using the respective storage policy settings. Thanks to the flexibility of storage policies, it is not an all-or-nothing decision. One could easily have the majority of VMs using mirroring across hosts, while a few mission-critical VMs could use mirroring and a secondary level of resilience applied.
Due to the design of how vSAN automatically determines availability through quorum, a minimum of three disk groups per host is needed for mirroring. If the hosts had four disk groups, RAID-5 erasure coding could be used for the secondary level of resilience, but one would have to run through an analysis to see if the cost of an extra disk group would be offset by the capacity savings noted above. RAID-6 is not possible in this topology.
Note that the secondary level of resilience is ONLY available when mirroring the data across the hosts. If the storage policy is not set to "Host mirroring (2-node)" the secondary level of resilience cannot be used.
Recommendation: Use secondary levels of resilient selectively on specific VMs to limit the potential capacity and performance impacts.
2-Node Clusters using Nested Fault Domains - Sizing for Performance
Sizing for performance remains the same in 2-node clusters for vSAN 7 U3 when there is no secondary level of resilience being used. When this capability is used, there are some performance considerations to be aware of.
When using the secondary level of resilience capabilities in a 2-node cluster, the performance considerations are almost identical to that of using a secondary level of resilience in a stretched cluster, as described in "Performance with vSAN stretched Clusters."
- Writes to data mirrored across hosts will result in 2x the write amplification.
- Writes to data mirrored across hosts and mirrored within each host through nested fault domains will result in 4x the write amplification.
- Writes to data mirrored across hosts and using RAID-5 within each host through nested fault domains will result in 8x I/O amplification (reads and writes).
vSAN 7 U2 and vSAN 7 U3 introduced some impressive performance enhancements to erasure coding, so the cost in I/O amplification may be less than what is stated above. The benefit of these enhancements is not guaranteed, as they depend on the conditions of the workload.
The I/O amplification noted above with host mirroring and secondary levels of resilience are synchronous. Meaning that it will only be as fast as the slowest device involved, exactly as described in the blog post above. Therefore, aim for using NVMe based devices (NAND flash or preferably Optane based) at the buffer tier when performance is critical, and ensure that the devices used at the capacity tier are high enough performing to keep up with the steady-state demands of the environment.
Recommendation: Use 25/100Gb NICs in the hosts in a 2-node cluster. Since they will be directly connected, the costs will be minimal while reducing the chance that the network is the point of contention in the environment.
The Case for Three Disk Groups in 2-Node Clusters
Even prior to vSAN 7 U3, a good case could be made for the use of three disk groups per host in a 2-node vSAN cluster. vSAN 7 U3 bolsters this case even more.
- It allows you to immediately begin using this new secondary level of resilience feature if is needed in your environment.
- It will likely improve performance through more parallelization of I/O and increase write buffer capacity. These two aspects work closely together to help deliver I/O requests as fast as the applications demand them.
- Even for 2-node clusters that use only host-level mirroring of data, it provides more flexibility in capacity-constrained scenarios. Hosts using two disk groups have limited ability to disperse data elsewhere upon failure of a disk group, which can place additional strain on capacity and policy requirements. We made some improvements in this area as described in "New Design and Operation Considerations for vSAN 2-Node Topologies" but three disk groups accommodate these challenges more easily.
While three disk groups are not always necessary, they are a great fit for 2-node designs that must meet very high levels of availability and performance.
Lifecycle Management of the Witness Host Appliance
Lifecycle management of 2-node clusters (and stretched clusters) is improved with vSAN 7 U3 in that the virtual witness host appliance can now be managed and updated using vSphere Lifecycle Manager (vLCM). This can take some of the mystery out of the coordination of updates, as it will assume the responsibility of updating the witness host appliance and do so in the appropriate order.
Note that witness host appliances shared across more than one 2-node cluster cannot be managed by vLCM. In those cases, the witness host appliance must still be updated using VUM or replaced with a new installation of the witness host appliance.
Planned and Unplanned Events in 2-Node clusters
Sharing another benefit with stretched clusters, 2-node clusters in vSAN 7 U3 provide better availability during planned or unplanned events. If a host is taken offline for maintenance (or fails), vSAN will use Adaptive Quorum Control (AQC) to recalculate the rules of quorum on an object-by-object basis. Once it completes (which may take anywhere from several seconds to a few minutes), the data on the host will remain available and accessible if a subsequent maintenance event or outage occurs with the witness host appliance.
This feature does not imply data availability in the event of a simultaneous, or near-simultaneous double failure of a host and a witness appliance. An occurrence of that sort will result in data unavailability. The goal of this feature was to offer more flexibility and availability to the data during events that occur after one of the hosts is offline.
Recommendation: Test this new behavior in your lab to become familiar with how it works. The time it takes to accommodate for the second event will vary depending on the number of objects to recalculate, as well as if the outage events were planned or unplanned. Testing this can help you understand how long you may need to wait in your environment prior to the completion of AQC.
vSAN 2-node clusters now offer more than just a simple platform for the edge. They are more powerful than ever. When your design exercises factor in the latest capabilities, they can accommodate for all-new levels of resilience and performance for many of your mission-critical workloads.