Using More GPUs in vSphere 8 for Machine Learning and HPC Applications with NVIDIA NVSwitch
Some of you may have recent experience of the power of very large machine learning models by trying the hugely popular ChatGPT from OpenAI. This chatbot is backed by an ML model that is based on GPT-3, with billions of parameters. That size of ML model takes a very large number of GPUs to train it and is a taste of what is coming to the enterprise in AI/ML. Data scientists in enterprises are exploring models that may not yet be at this huge scale. But they are certainly seeing increasingly large models as measured by the number of model parameters, demanding more acceleration and more GPU power.
In a previous post, we talked about the concept of “vendor device groups” in VMware vSphere 8 for presenting a set of 2 to 4 GPUs on a server that are in an NVLink setup to a VM. One important use case for this is ML model training, but it can also be put to other uses, such as analytics in HPC. In that article, we saw NVLink bridge hardware for those direct point-to-point GPU connections as shown here. To connect two A100 GPUs together in a server, we would use three NVLink bridges as shown.
In this article, we take the device group concept to the next level, by exploring sets of 8 GPUs that are all connected using NVSwitch technology within a server – and presenting them as one device group to a VM. NVSwitch uses the NVLink protocol for communication between GPUs, so the two have a tight association.
NVIDIA NVSwitch hardware comes either in a chip format on the baseboard or as a separate unit that can stand independently outside of a server. We will talk about the use of NVSwitch technology within one server in this article. NVSwitch hardware appears in the HGX baseboard format from NVIDIA that is available in modern servers from a number of OEMs. Two HGX baseboards can be connected together within one server.
Here is a server baseboard that has 8 GPUs on it with six NVSwitch devices linking them together. This is a standard configuration for this NVIDIA-supported board.
The NVLink ports on each GPU are used to connect them to the NVSwitch devices. This allows traffic to be routed from one GPU to any other GPU on the NVSwitch layout at high speed. Contrast this with the NVLink bridge hardware, where a point-to-point connection is made between GPUs. NVLink bridges support up to four fully-connected GPUs. NVSwitch supports 8 or more GPUs with fast connections.
With NVSwitch, the data is being routed from any one GPU to any other GPU in the setup. With NVLink on the NVIDIA Hopper architecture GPUs, a pair of GPUs can send data at 450 GB/s in each direction, giving a total bandwidth of 900 GB/s. The unidirectional bandwidth is 300 GB/s for A100 GPU, giving 600 GB/s total bidirectional bandwidth there.
To make it easier for the VMware system administrator to use this kind of setup with VMware vSphere 8 U1, that full set of 8 GPUs along with two alternate subsets of the 8 are presented as device groups, when we choose to add a PCIe device to a VM. The set of device groups is seen in the vSphere Client interface below. We used the earlier A100 40GB model in our lab tests here. Different device group names would appear for the A100 80 GB model or for the H100.
The vSphere Client user chooses one of the various device groups shown above to give the VM the required amount of GPU power. The device groups shown here offer 2, 4 and 8 full 40c profile vGPUs to the VM, in time-sliced mode. All of these are operating over connections via the NVSwitch. The number after the word “Nvidia” in the Name field above indicates the number of GPUs in that device group. Note that the string after the “@” in the device group Name field is a vGPU profile denoting the full memory from that GPU. The benefits of the vGPU approach therefore apply here, including the correct VM group placement onto the most suitable host to satisfy the GPU requirements. These GPUs are not shared among VMs (one of the features of vGPU use) - instead, they are fully assigned to one VM. This is the normal case when more than one GPU is given to a VM.
These device groups for GPUs are automatically discovered by the NVIDIA vGPU host driver software that is installed directly into the vSphere host and the information is passed into vSphere. Once that NVIDIA host vGPU driver software is running, the vSphere administrator can use one of the allowed device groups for a VM, at configuration time. By doing that, the administrator is requesting that this VM be placed onto a suitable host to support the device group, when it is started up. The vSphere software keeps the mapping of device group specifications to the various hosts that can support them.
In the VM’s guest operating system, the nvidia-smi tool shows all the GPUs that are available to it. The example output below shows a setup where four GPUs are chosen to be allocated as one device group to a VM. Since these were allocated as a device group, we now take one further step below to see that there is NVSwitch technology connecting them together. The GPUs are in their default time-sliced mode, with MIG being disabled. This is the normal case when multiple GPUs are assigned to a VM. Multiple GPUs are not allowed to be assigned to one VM when MIG is enabled.
The presence of the NVLink/NVSwitch connections between the VM's assigned GPUs can be seen in the VM's guest operating system using the command "nvidia-smi topo -m". This is shown here.
The application developer can now proceed to make use of all four of these A100 GPUs for training their model in TensorFlow, PyTorch or other platform of their choosing.
Here is a short (2.5 minute) demo of the device group functionality for assigning 2, 4 or all 8 GPUs to a VM on vSphere 8. There are callout notes on the screen to explain what is happening in the demo, rather than a voiceover.
The NVSwitch support described in this article is a function of the vSphere 8.0 Update 1 release and the NVIDIA 525.xx vGPU drivers (i.e. NVIDIA vGPU version 15) from the NVIDIA AI Enterprise Suite.
High-end machine learning models can easily span multiple physical GPUs because of the larger memory needs of their ML models and the need for more compute power. VMware and NVIDIA together now enable data scientists to train their models across up to 8 GPUs on a single VM at once, where those GPUs are communicating over high-bandwidth NVSwitch and NVLink connections. This demonstrates the joint commitment of VMware and NVIDIA together to serving the needs of the most demanding machine learning and high-performance computing applications.