vSAN Design Considerations - Fast Storage Devices Versus Fast Networking

Delivering the highest levels of performance in a data center is a challenge that has many facets. Determining which discrete hardware or software component is the most important, or a potential bottleneck is difficult to answer because all of the variables have an indelible tie to one another. This is one of the reasons why I structured Troubleshooting vSAN Performance using a framework that breaks down these influencing factors.

For many, architectures built around a hyperconverged architectures add to this mystery. Which is more important for a performance-focused vSAN cluster: Fast storage devices, or fast networking? This is a common question, and while the answer is that both are important, it takes some elaboration to better understand how the impact on the environment may be different.

I/O Processing in vSAN

One of vSphere's strengths is the ability to prioritize. The assortment of schedulers built into the hypervisor manages host CPU processes, ingress and egress network activity, and as with vSAN, storage I/O. Since they are kernel-level processes, it is performed with extraordinary efficiency.

vSAN uses its own scheduler to identify and prioritize different types of storage I/O running through the stack. This is in part what makes the Adaptive Resync feature introduced in vSAN 6.7 so effective. Note that these are mechanisms to help prioritize I/O activity that is local to the host.

Storage Devices

Raw performance of storage devices varies significantly, even when we limit the discussion to solid-state. SATA, SAS, and NVMe devices have dramatic differences in top-end performance, as well as consistency. Even the very fastest NVMe devices using NAND flash are not the king of the performance hill, as newer technologies like 3D XPoint (e.g. Intel Optane) overcome some of the hurdles associated with NAND. The industry moves fast, but thanks to the architecture of vSAN, you can introduce these technologies in an incremental fashion as the market and the demands of the data center evolve.

When planning for a higher-performing vSAN cluster, fast storage devices at the buffer tier, and at the capacity tier are a must. The storage devices are a part of the "final mile" of the data path, and lower performing devices may undermine your ability to meet required performance expectations.

Networking

Network switches and the interface cards that connect to them are what glues everything together. Networking plays a particularly important role with HCI as much of the I/O activity may have to extend beyond the local host. Unfortunately, the industry practice of referring to a switch specification simply by its maximum port speed dismisses all of the important details about the switchgear. Switchgear capabilities are dependent on other factors such as the backplane bandwidth, the amount of port buffering available on the switch, and if the ASICs are powerful and plentiful enough to keep up with the "packets per second" processing requirements of the environment. When helping others trying to pinpoint the cause of their performance issues, and I ask about their switchgear, unfortunately, the responses often are not much longer than, "10 Gigabit." John Nicholson has a series of good posts that touch on this subject and watching the VMworld session that he and Broc Yanda presented is also worth your time.

Another challenge with switchgear is that the typical life in production is longer than other assets in the data center. Longer life means that you have to be more aware of your future demands and invest in them perhaps sooner than you may wish.

How They Relate to Each Other

Host-based schedulers, like vSAN's, prioritize the I/O as it is being generated by the host, or coming into the hardware stack of the host. It doesn't have network awareness and assumes that no other packets are out there in the ether, waiting to be sent to the host. But with undersized, or poor performing switches, this may be what is occurring. If the network is experiencing saturation in some form (bandwidth, processing and buffering limits, etc.), then it can be subject to the rules of TCP. Packet drop rates increase, and thus, so do packet retransmits. While TCP is robust, the steps for handling over saturation are costly to peak performance and consistency. Even worse is that this saturation may be impacting every system that is connected to the switches.

Some who contemplate the move up to higher performing networking want assurance that they will gain immediate benefit, and instantly use all of the bandwidth capabilities that they just paid for. That is the wrong way to look at it, as resource utilization for CPU, networking, or storage rarely works that way. For vSAN environments, investing in properly sized switchgear minimizes or eliminates the inconsistent degradation of storage performance of vSAN as a result of a bottleneck at the switches due to network saturation.

For example, if a customer is currently using modest 10Gb switches, and contemplating 25/100Gb, the value does not come from utilization rate of the switches, but rather, preventing the network from being the point of contention, and preventing the host from reaching their potential. Remember that as the performance capabilities of the hosts increase, so should your network.

Summary

Optimizing performance is often about shifting the bottleneck to a location that is the easiest to identify and control. Investing in higher quality, faster-performing switchgear helps shift potential contention back to the host, where it is easier to control through sophisticated schedulers, and remedy through faster storage devices. Good switches do indeed have up-front costs, but given the longer lifecycles of switches paired with the abilities for vSAN to easily accommodate newer, faster hardware, it is a wise step for any data center design.

@vmpete