March 09, 2021

Multiple Machine Learning Workloads Using NVIDIA GPUs: New Features in vSphere 7 Update 2

NVIDIA and VMware are collaborating on many levels. Among the key technical innovations that have come from that collaboration are (a) certification of the NVIDIA AI Enterprise suite on vSphere (b) support for the latest high-end A100 GPU on vSphere 7 Update 2 with multi-instance GPU enhancements and (c) support for address caching enhancements to PCIe devices in vSphere that accelerate peer-to-peer device communication within a host - giving a boost to GPUDirect RDMA. All of these will help you get the very best performance from your machine learning setup on vSphere 7 Update 2.

Machine Learning/AI workloads are growing quickly throughout the applications world from medical imaging and research to customer behavior prediction to fraud prevention in finance – and there are many other examples. We look at three performance and productivity boosting enhancements appearing in vSphere 7 Update 2, as a result of VMware's collaboration with NVIDIA: 

  • The NVIDIA AI Enterprise suite of software certified on vSphere;
  • Support for the latest generation of GPUs from NVIDIA based on the Ampere architecture;  
  • The new optimizations in vSphere for device-to-device peer communication over the PCIe bus, enabling increased performance with NVIDIA’s GPUDirect RDMA.

The NVIDIA AI Enterprise Suite - Certified on vSphere

The NVIDIA AI Enterprise suite is the result of the collaboration between the two companies to bring AI to the data center.  NVIDIA AI Enterprise  is  an end-to-end, cloud-native, suite of  AI and data analytics software,  optimized, certified and supported by NVIDIA to run on VMware vSphere 7 Update  2  with  NVIDIA-Certified  systems.  It includes key enabling technologies  and software from NVIDIA for  rapid deployment, management and scaling of AI workloads  in the modern hybrid cloud.   For more details on the NVIDIA AI Enterprise see this article.

Support for the Latest Generation of NVIDIA GPUs

The latest hardware accelerator for these ML workloads, the Ampere Series A100 GPU from NVIDIA, with its support for Multi-Instance GPUs (MIG), is a really important step for machine learning users and for systems managers in the vSphere 7 Update 2 release. 

NVIDIA MIG

This powerful NVIDIA A100 GPU, in fact the most powerful one for ML offered by NVIDIA, is now fully supported for the first time on vSphere 7  Update 2. We gave you a technical preview of the main features of this new architecture in an earlier blog.

You now have full support for the A100 with MIG in virtual machines on vSphere 7 Update 2. This support is for both the more traditional time-sliced vGPU mode and the new MIG-backed vGPU mode, as well as the DirectPath I/O method. Both vGPU modes support vMotion and DRS' initial placement of VMs onto hosts.

The traditional time-sliced NVIDIA vGPU mode, that pre-dates MIG, does not provide strict hardware-level isolation between VMs that share a GPU. It schedules jobs onto collections of cores called streaming multiprocessors (SMs) according to user-accessible algorithms such as fair-share, equal share or best effort. All cores on the GPU are subject to being used in time-sliced NVIDIA vGPU, as well as all the hardware pathways through the cache and cross-bars to the GPU’s framebuffer memory. You can find more technical detail about this new form of vGPU support here.

When choosing the type of vGPU support you will allow a VM to have, you use the NVIDIA GRID vGPU Profile choices in the vSphere Client. Here is that view of the set of vGPU profiles available on an A100-equipped host for the time-sliced vGPU drivers. The number before the final “c” in the NVIDIA vGPU profile name is the number of GB of framebuffer memory (i.e. memory on the GPU itself) that this profile will assign to a VM. Here, we are choosing to have 5 GB assigned to our VM.

vGPU profile choice for time-sliced mode in vSphere Client

MIG mode for vGPUs, new in vSphere 7 Update 2, differs from the time-sliced approach above in one important way. A VM with a MIG-backed vGPU now has a specific, dedicated hardware allocation of streaming multiprocessors (SMs), framebuffer (on-GPU) memory and the various hardware paths to them.

These assigned hardware items (including GPU memory, SMs, cross-bars, caches) along with the hardware paths to those components are isolated to that VM. This strict isolation is the key difference between the two modes of using a vGPU. Once MIG is enabled at the host server level, the administrator may choose from a set of vGPU profiles that are MIG-backed to determine that allocation. Here is that MIG-backed vGPU profile setup step in the vSphere Client. This choice of vGPU profile is done before the VM’s guest operating system boots up.

vGPU profile choice for MIG in vSphere Client

In the MIG-backed vGPU profile names above, the number immediately after the “a100” indicates the number of slices of the total SMs on the GPU that are allocated by assigning that profile to the VM. There are up to 7 such SM or "Compute" slices available for the A100, each with 14 SMs per slice.

The number immediately preceding the final “c” (for “Compute”) in the profile name indicates the amount of framebuffer memory in GB that will be allocated by that profile to the VM. You may allocate all 7 SM/compute slices and all 40 (or 80 if present) GB of memory on a physical A100 to one VM if you wish, by using the “grid-a100-7-40c” profile in the list above for your VM. That VM then has complete ownership of the GPU and does not share it.

Because of this hardware-level isolation, the A100 has the ability to host sets of parallel workloads simultaneously on a GPU while giving each workload a Quality of Service assurance level - above and beyond what was available in time-sliced vGPU form. In time-sliced vGPU, jobs were run in series. With MIG, there is true parallel job execution.

This is done through allocating one or more physical slices of the hardware exclusively to one VM, as a “GPU Instance”. There are 7 SM (compute) slices available in total. Each SM slice is made up of a set of 14 streaming multiprocessors. Each GPU memory slice has 5GB of the initial 40 GB of framebuffer memory.

Note: There is also a newer A100 model with 80 GB of framebuffer memory on the card.

Allocate Differing Shares of the GPU to Different VMs and Workloads

We can give all 7 compute slices to one VM using a suitable vGPU profile. Or we can allocate just one compute slice to a VM, using a different vGPU profile. Other different combinations are available to mix and match allocations to workloads. Here, shown in green, is one example of a valid set of allocations, all on one physical GPU.

vGPU MIG Slices Choice from NVIDIA MIG User Guide

The vSphere Client user interface presents a choice of NVIDIA GRID vGPU profiles to the user, as seen previously. This provides a friendly front-end to these GPU Instance units seen in blue and green here - vSphere takes care of creating the right GPU Instance for you. The vGPU profile chosen in the vSphere Client maps directly to a GPU instance, as seen above. The GPU Instance encapsulates the combination of compute slices and memory slices that are available.

In the past, with time-sliced vGPUs, all VMs sharing a single GPU on a host got an equal share of that device. This changes with MIG. The chart above shows the types of slice size you can allocate at one time to different VMs on a MIG-backed vGPU setup. These are identified by the vGPU profile - which itself wraps a GPU Instance. A GPU Instance therefore is made up of a set of memory and compute slices combined together, as seen above. The GPU Instance also includes a Compute Engine that we won't delve into here.

Going from left to right across the diagram, you can allocate NVIDIA vGPU profiles to simultaneous VMs where the columns do not overlap. For example, you can combine two VMs on the same host;

  • One VM with a profile that has 4 of the total 7 SM slices and 20 of a total 40 GB of memory (i.e., 4 memory slices each of 5 GB)
  • Another VM that has just 1 of the 7 SM/compute slices and only 1 memory slice (5 GB of framebuffer memory).
     

You may well choose the 1 compute and 5 memory combination (1-5c) for several small inference jobs, while giving the 4-20C profile to a larger consumer of GPU compute and memory power, like model training. The 1-5c vGPU profile gives your VM 1 of the 7 fractions of the total SMs (each fraction being 14 SMs) and 5 GB of the framebuffer memory. This is the smallest allocation with MIG. The 4-20c vGPU profile gives the VM 56 SMs to work with (4x14) and 20 GB of the framebuffer memory on a 40GB A100.

You have therefore got fine-grained control over how much GPU power your VM has, both at the compute level and the memory level, with MIG. You can share your GPU in unequal parts also.

Enhancements to GPUDirect RDMA on vSphere

When users choose to spread their Machine Learning/AI application across multiple host servers and networked VMs, then fast GPU-to-GPU communication across the network, for exchanging GPU memory content,  becomes an essential component of the application’s performance. One important example of this is doing distributed training on datasets that are too large for one GPU – or on ML models that do not fit well within one GPU. In distributed training, the nodes exchange data at many intervals about weights and coefficients that are changing within the model.

This GPU-to-GPU communication over the network occurs at scale in a sophisticated distributed ML system, that may have hundreds of nodes in it.

Optimal communication between each GPU and the local network device within any one host server is essential to this distributed working. VMware vSphere 7 Update 2 enhances this communication through a PCIe mechanism named Address Translation Service (ATS).  ATS allows local caching of each PCIe device’s virtual-to-physical address mapping.  This speeds up the local peer-to-peer device communication within a host. It therefore enhances the rate at which data travels from the GPU to the network card locally (bypassing the CPU and main memory). That network traffic then goes on to another GPU on another host, with the same ATS speed-up on the remote side. This gives a significant performance boost when multiple servers are present that are using GPUDirect RDMA. 

ATS, part of the PCIe standard, allows addressing to take place in a more efficient way, bypassing the IOMMU, provided the devices are communicating on the same PCIe Bridge in a peer-to-peer mode. This can give a significant boost to your distributed application's performance.  Some measurements of this performance gain are given here. ATS is enabled by the system administrator at the host server level. The GPU and network card must be assigned to the same VM. The GPU may be in Passthrough or (new in vSphere 7 Update 2) in vGPU mode and the network card can have one of its SR-IOV virtual functions passed through to that VM. Support for vGPU-based VMs with this ATS performance enhancement is one of the important new features brought to users by vSphere 7 Update 2.

 

Filter Tags

AI/ML Application Acceleration Modern Applications ESXi ESXi 7 vCenter Server vCenter Server 7 vSphere vSphere 7 Hardware Acceleration vMotion Blog Deployment Considerations Technical Overview What's New Intermediate Design Deploy Optimize