Technical Guides from VMware and Partners on Infrastructure for AI/Machine Learning
At last year’s VMworld conference, NVIDIA and VMware jointly announced their close partnership in the field of democratizing AI/ML for the enterprise. Early in 2021, the two companies jointly previewed the NVIDIA AI Enterprise Suite of software on vSphere 7 Update 2. Since that time, the two teams have been working hard together and we are now seeing the first set of documents that explain the infrastructure setup for support of the NVIDIA AI Enterprise Suite on vSphere. There will be follow-on documents to these also.
The big picture of the joint AI-Ready Enterprise Platform offering is shown here. As you can see, it has several layers of components that serve the needs of both the data scientist/ML developer user and the vSphere systems administrator.The tools within the NVIDIA AI Enterprise Suite, such as the Transfer Learning Toolkit and TensorRT, appeal to the data scientist as they make his/her life easier in developing ML-based applications. More can be found on these at http://ngc.nvidia.com
The vSphere Administrator can now also better serve the needs of the data scientist by using the pre-built infrastructure that is designed and validated by the two partners.
The first set of technical documents gives the vSphere systems administrator the tools to prepare the infrastructure-level setup for deployment of the NVIDIA AI Enterprise Suite in a scalable and performant way. This is mainly focused on the lowest portion of the suite shown above. The details of sizing the various NVIDIA components at the application level will come from the NVIDIA teams following these guides.
VMware Technical Deployment Guidance
The detailed AI-Ready Enterprise Platform on vSphere 7 technical deployment guide provides the infrastructure details and steps to implement the virtualized GPUs, high-speed networking and the supporting storage to lay the foundation for the NVIDIA AI Enterprise Suite that will make use of it. It describes MIG-backed as well as time-slice based vGPUs in details. It is divided into two sections: the single-node introduction allows the simplest form of deployment of an ML system that is self-contained on one vSphere host with a small collection of VMs. Secondly, the more advanced multi-node design supports distributed ML training on multiple nodes/VMs through use of networking technology such as RDMA. RoCE and PVRDMA with underlying vSphere features like Address Translation Services and Access Control Services providing performance at the PCIe layer. This deployment document will be paired with further Day 2 Operations guidance in due course.
VMware Sizing Guidance
The VMware design sizing guidance document takes a set of classic ML applications, like RESNET-50, that we have used in the labs at VMware and matches them with the appropriate CPUs/GPUs and other infrastructure components that support them for best performance. Your own application set may differ, but it can be compared at the requirements level with these sizing examples to give you a feel for how the virtualized ML system will behave under load. This gives you a good basis for understanding how to construct your own infrastructure to suit the application.
Dell and HPE Reference Documents
Along with the VMware technical guidance given in the two articles above, there are two recently-published detailed reference documents on deploying AI/ML on vSphere from our OEM partners, Dell and HPE. We recommend that you read these carefully for their designs and best practices in setting up your systems for machine learning with NVIDIA AI Enterprise. The server systems mentioned in these documents are also certified by NVIDIA, giving you confidence that the companies have tested and validated on these platforms.
The Dell Design Guide for Virtualizing GPUs for AI with Dell EMC Infrastructure document describes recommended hardware and software configurations from Dell along with acceleration from NVIDIA for AI/ML applications. It talks about different sizings of server and networks for different configurations.
The HPE Reference Configuration for Enterprise AI Infrastructure on HPE ProLiant DL380 Gen 10 Server is a great starting point for you in adopting AI/ML on your virtualized infrastructure. The HPE guide document delves into the details of suitable HPE hardware for this purpose. This valuable guide describes the compute, storage, networking and acceleration aspects of an AI/ML design with a focus on the NVIDIA A100 as the core GPU in the architecture.
You can find more information on the NVIDIA AI Enterprise Suite of software at the following reference pages.
Solution Brief: AI-Ready Enterprise Platform
Early Access: NVIDIA AI Enterprise Product Page (with early access signup)
Partner Site: VMware Partnership Page
Partner Site: NVIDIA Partnership Page