Technical Guides from VMware and Partners on Infrastructure for AI/Machine Learning

June 22, 2021

At last year’s VMworld conference, NVIDIA and VMware jointly announced their close partnership in the field of democratizing AI/ML for the enterprise. Early in 2021, the two companies jointly previewed the NVIDIA AI Enterprise Suite of software on vSphere 7 Update 2. Since that time, the two teams have been working hard together and we are now seeing the first set of documents that explain the infrastructure setup for support of the NVIDIA AI Enterprise Suite on vSphere. There will be follow-on documents to these also.

Landscape picture of the AI-Ready Enterprise offerings from VMware and NVIDIA

The big picture of the joint AI-Ready Enterprise Platform offering is shown here. As you can see, it has several layers of components that serve the needs of both the data scientist/ML developer user and the vSphere systems administrator.The tools within the NVIDIA AI Enterprise Suite, such as the Transfer Learning Toolkit and TensorRT, appeal to the data scientist as they make his/her life easier in developing ML-based applications. More can be found on these at http://ngc.nvidia.com

NVIDIA AI Enterprise Suite of Frameworks/Tools for Machine Learning

The vSphere Administrator can now also better serve the needs of the data scientist by using the pre-built infrastructure that is designed and validated by the two partners.

The first set of technical documents gives the vSphere systems administrator the tools to prepare the infrastructure-level setup for deployment of the NVIDIA AI Enterprise Suite in a scalable and performant way. This is mainly focused on the lowest portion of the suite shown above. The details of sizing the various NVIDIA components at the application level will come from the NVIDIA teams following these guides.

VMware Technical Deployment Guidance

The detailed AI-Ready Enterprise Platform on vSphere 7 technical deployment guide provides the infrastructure details and steps to implement the virtualized GPUs, high-speed networking and the supporting storage to lay the foundation for the NVIDIA AI Enterprise Suite that will make use of it. It describes MIG-backed as well as time-slice based vGPUs in details.  It is divided into two sections: the single-node introduction allows the simplest form of deployment of an ML system that is self-contained on one vSphere host with a small collection of VMs. Secondly, the more advanced multi-node design supports distributed ML training on multiple nodes/VMs through use of networking technology such as RDMA. RoCE and PVRDMA with underlying vSphere features like Address Translation Services and Access Control Services providing performance at the PCIe layer. This deployment document will be paired with further Day 2 Operations guidance in due course.

VMware Sizing Guidance

The VMware design sizing guidance document takes a set of classic ML applications, like RESNET-50, that we have used in the labs at VMware and matches them with the appropriate CPUs/GPUs and other infrastructure components that support them for best performance. Your own application set may differ, but it can be compared at the requirements level with these sizing examples to give you a feel for how the virtualized ML system will behave under load. This gives you a good basis for understanding how to construct your own infrastructure to suit the application.

Dell and HPE Reference Documents

Along with the VMware technical guidance given in the two articles above, there are two recently-published detailed reference documents on deploying AI/ML on vSphere from our OEM partners, Dell and HPE. We recommend that you read these carefully for their designs and best practices in setting up your systems for machine learning with NVIDIA AI Enterprise. The server systems mentioned in these documents are also certified by NVIDIA, giving you confidence that the companies have tested and validated on these platforms.

Dell

The Dell Design Guide for Virtualizing GPUs for AI with Dell EMC Infrastructure document describes recommended hardware and software configurations from Dell along with acceleration from NVIDIA for AI/ML applications. It talks about different sizings of server and networks for different configurations.

Hewlett-Packard Enterprise

The HPE Reference Configuration for Enterprise AI Infrastructure on HPE ProLiant DL380 Gen 10 Server is a great starting point for you in adopting AI/ML on your virtualized infrastructure. The HPE guide document delves into the details of suitable HPE hardware for this purpose.  This valuable guide describes the compute, storage, networking and acceleration aspects of an AI/ML design with a focus on the NVIDIA A100 as the core GPU in the architecture.

References

You can find more information on the NVIDIA AI Enterprise Suite of software at the following reference pages.

Blog: The AI-Ready Enterprise Platform: Unleashing AI for Every Enterprise

Blog: How Suite It Is: NVIDIA and VMware Deliver AI-Ready Enterprise Platform

Demo: NVIDIA AI Enterprise on VMware vSphere

Solution Brief: AI-Ready Enterprise Platform

Early Access: NVIDIA AI Enterprise Product Page (with early access signup)

Partner Site: VMware Partnership Page

Partner Site: NVIDIA Partnership Page

 

Filter Tags

AI/ML Application Acceleration Modern Applications ESXi 7 vCenter Server 7 vSphere vSphere 7 vSphere with Tanzu Assignable Hardware Distributed Resource Scheduler (DRS) ESXi Hardware Acceleration Kubernetes Multi-Instance GPU (MIG) Paravirtual RDMA (PVRDMA) SR-IOV vMotion GPU vSphere Client vSphere Distributed Switch (vDS) Blog Deployment Considerations Reference Architecture Overview Design Deploy Manage