This article was co-authored by Joe Cullen, Technical Marketing, NVIDIA and Justin Murray, Technical Marketing, VMware.
Large Language Models are an essential component of Generative AI, which is why Enterprises are rushing to integrate these models into their own datacenters today. LLMs allow future-facing organizations to pivot and build smarter chatbots, create marketing material, summarize large documents and predict business conditions.
Enterprise Generative AI
ChatGPT is a very powerful tool, and many businesses are using it for serious projects. However, enterprises need the ability to customize foundational models with their own proprietary data, while controlling access to that data and scaling into production. Various skills and focus areas along with continuous improvement pipelines, ensure enterprises are getting the best responses and model performance needed to deliver AI to their end users.
Figure 1 : Customizing/tuning a foundation model to create an enterprise-specific model
Early adopter businesses are finding as they use LLMs that multiple different experiments are needed to be done by the data scientist to
(a) Choose the correct compute infrastructure for running model customization, in addition to giving due consideration to inference workloads that will come later (and those platforms will be different)
(b) Choose the most appropriate foundation model and the right dataset for customizing the model
(c) Enrich LLMs to answer questions from an Enterprise Knowledge Base
(d) Iterate and test models in a simple to use inference playground
(e) Scale out the production inference infrastructure quickly
The data scientist therefore has many moving parts to deal with, from the versions of Python and the many data science libraries that use it, all the way up to the different types of models they can use today. Industry and academic innovation is happening here, especially in the new model space, at a very fast pace. It is hard to keep up with these developments, even for the experts!
Because of this landscape, data scientists are being overwhelmed, as they have to re-build or re-factor their LLM platforms and infrastructure almost on a daily basis. When we tested LLMs in-house at VMware, for our own use, we needed to move entire projects from one lab setup to another - to get more GPU power, for example.
This phenomenon of changing software versions and changing infrastructure is very natural for data science work today and it is facilitated by the VMware Cloud Foundation. It is far easier and quicker to create VMs and containers in Kubernetes within those VMs, than to do such a big change on bare metal. This is a fast-moving environment. IT needs to provision those GPU-enabled VMs quickly and not wait around for a hardware purchase to complete.
The focus in this article is on two parts: (1) VMware Private AI Foundation with NVIDIA AI Software, which helps the user to customize, retrain and deploy their models using NVIDIA’s techniques and (2) Underlying virtualized infrastructure that makes rapid re-factoring easier for the data scientist and provides a robust, managed inference environment.
Production Ready Software for Generative AI
VMware and NVIDIA have collaborated for several years now on enabling virtualized GPUs, high-speed networking and developing ML tools and platforms for the data scientist to use. An outline of the jointly-developed VMware and NVIDIA reference architecture for Gen AI is shown below.
VMware Private AI Foundation
VMware Private AI Foundation with NVIDIA brings together VMware Cloud Foundation and NVIDIA’s AI software, making it easy to go from Gen AI development to production. VMware continues to drive technical innovations at the infrastructure layer to enable enterprises to customize and deploy LLMs for Generative AI. The key technical recommendations for Generative AI deployment are given in the VMware Baseline Reference Architecture document.
The key VMware infrastructure components should be very familiar to users of vSphere and VMware Cloud Foundation (VCF) today. For example, one of the vSphere technical requirements for customizing LLMs is to use multiple virtualized GPUs and high-speed networking, described in the baseline reference architecture. LLM customization typically requires multiple full profile vGPU profiles assigned to the VM, giving more GPU memory for model customization.
NOTE: The details such as the NVIDIA drivers and GPU/network operators are given in the baseline reference architecture.
The NVIDIA AI Enterprise software components used for customizing LLMs include frameworks, tools as well as a set of Generative AI Knowledge Base Reference Workflows. Organizations can start their AI journey by using the open, freely available NVIDIA AI frameworks to experiment and pilot. But when they’re ready to move from pilot to production, enterprises can easily transition to a fully managed and secure AI platform with an NVIDIA AI Enterprise subscription. This gives enterprises deploying business critical AI the assurance of business continuity with NVIDIA Enterprise Support and access to NVIDIA AI experts. The NVIDIA AI Enterprise software suite is fully supported on vSphere. The following sections describe the NVIDIA AI Enterprise tools and software components for Generative AI.
NVIDIA AI Workbench
NVIDIA AI Workbench is an easy to use toolkit that allows a Data Scientist to seamlessly connect to multiple virtual compute clusters. Data scientists will often start working locally, experimenting with an LLM. But as the AI project gets more complex, they need much more memory and compute power to customize the LLM. By using AI Workbench they are able to quickly and easily migrate their project to a more powerful VMware cluster. This toolkit automatically creates the project's environment, builds the container with all dependencies including Jupyter so that it is optimized to run on the virtual cluster. The data scientist can then use this GPU accelerated environment to customize their LLM.
The NVIDIA NeMo Framework
NVIDIA NeMo framework is a game-changer for enterprises looking to leverage generative AI. This is an end-to-end, cloud-native framework that is used to build, customize, and deploy generative AI models anywhere. It includes training and inferencing containers, which include libraries such as the Pytorch lightning that can be used for p-tuning a Foundational Model.
NeMo allows Enterprises to continuously refine LLM models with techniques such as p-tuning, reinforcement learning, supplemented by human feedback. This flexibility enables the development of functional skills, focus on specific domains, to prevent inappropriate responses.
NeMo integrates seamlessly with the NVIDIA Triton Inference Server to accelerate the inference process, delivering cutting-edge accuracy, low latency, and high throughput. As part of the NVIDIA AI Enterprise software suite, NeMo is backed by a team of dedicated NVIDIA experts providing unparalleled support and expertise.
Once an LLM has been customized, it can be optimized for low latency and high throughput inference using NVIDIA Tensor RT. TensorRT-LLM is a toolkit to assemble optimized solutions to perform LLM inference. The Python API can be leveraged to define models and compile efficient TensorRT engines for NVIDIA GPUs. Additionally, Python and C++ components can be used to build runtimes to execute those engines. TensorRT-LLM supports multi-GPU and multi-node configurations (through MPI). TensorRT-LLM also includes Python and C++ backends for NVIDIA Triton Inference Server to assemble solutions for LLM online serving.
Triton Management Service (TMS), exclusively available to NVIDIA AI Enterprise subscribers, simplifies the orchestration of scaling Triton Inference pods on Kubernetes in production and is exclusively available to NVIDIA AI Enterprise customers. Triton Inference Server pods are the key locations for the deployment of your model into production.
TMS makes deploying Triton models on Kubernetes (K8s) easy, even with limited Kubernetes experience. It helps manage dynamic loading/unloading of models, model auto scaling and better overall model management.
NVIDIA Generative AI Knowledge Base Q&A Workflow
The NVIDIA Generative AI Knowledge Base Questions and Answering AI Workflow is a reference example of how to use the aforementioned NVIDIA AI Enterprise components to build a Generative AI chatbot. By leveraging these components, the chatbot is able to accurately answer domain-specific questions, based on the enterprise knowledge-base entities.
The NVIDIA Generative AI Knowledge Base Questions and Answering AI Workflow leverages an existing AI foundational model that is an open-source community LLM (Llama 2) and performs prompt-tuning (p-tuning). Adapting an existing AI foundational model is a low-cost solution that enterprises can leverage to accurately generate responses to your specific enterprise use case. Once the model has been p-tuned, the model is then chained to a vector database by using Langchain. This allows multiple LLM applications to talk to the LLM and the answers are based upon real enterprise datasources.
This AI reference workflow contains:
- NVIDIA AI Workbench
- NeMo Framework for Training and Inference
- NVIDIA Triton Management Service (TMS)
- TensorRT LLM (TRT-LLM) for low latency and high throughput inference for LLMs
- Langchain and vector database
- Cloud native deployable bundle packaged as helm charts
- Guidance on performing training and customization of the AI solution to fit your specific use case. For example, Low Rank Adaptation (LoRA) finetuning or p-tuning.
VMware Infrastructure for Generative AI - an Outline
In keeping with the VMware+NVIDIA AI-Ready Enterprise strategy, VMware continues to drive technical innovations at the infrastructure layer to enable enterprises to train and deploy LLMs for Generative AI. The key technical recommendations for Generative AI deployment are provided within VMware's Baseline Reference Architecture document and we look at a few of those new features here.
LLM customization and inference pipelines are orchestrated using Kubernetes. This is the orchestrator technology platform of choice for VMware and NVIDIA since many LLM tools and platforms make use of Kubernetes technology today. A combination of GPU-capable nodes and non-GPU-capable nodes in your Kubernetes cluster is recommended as well. The following screenshot illustrates a simpler version of a Gen AI cluster, when deployed on vSphere and VMware Cloud Foundation (VCF).
Tanzu Kubernetes Clusters (TKCs) provides Enterprises with the ability to create LLM workload clusters very quickly and adjust their size and content quickly as needed. This provides the flexibility to accommodate various data scientist teams, where each team needs separate ML toolkits or platform versions. Each can co-exist on the same hardware, but each team has their own TKC as a self-contained sandbox. With suitable permissions, the devops or administrator user can create a new cluster with one “kubectl apply -f” command – and make changes to the clusters in a very similar way.
LLM Foundation models may not fit into the space allowed by one GPU’s framebuffer. For that reason, LLM customization workloads require several GPUs attached to their worker VM. The “VM class” mechanism in vSphere makes this very easy to do. A VM can be assigned one, two or more virtual GPUs, using NVIDIA vGPU profiles. This type of multi-GPU arrangement is also very easy to do on regular, non-Kubernetes VMs in the VMware vSphere Client.
In the example screenshot below, an administrator or devops person is choosing a device group that will be added to a VM.
This device group can represent 2, 4 or 8 GPUs. Additionally, vSphere can now discover that those GPUs are using NVLink and NVSwitch at the hardware layer. These features make it very easy to construct a powerful VM that can handle very large model sizes.
GPUs that leverage NVSwitch/NVLink have a very high-bandwidth connection directly between all the GPUs. This allows up to 600 GB/second bidirectional bandwidth on Ampere-class GPUs and up to 900 GB/second bidirectional bandwidth on the Hopper-class (H100) GPUs. These levels of speed are needed when large models are being trained across the set of GPUs available in the HGX class of machines from vendors. The following diagram further illustrates NVSwitch/NVLink-based GPUs.
Multi-Node Training/Customization of Models
In some cases, the data science users will want to distribute the model customization across multiple servers with different VMs (referred to as “multi-node training”). This approach is used if the model does not fit within the combined GPU memory of one server's collection of GPUs. The infrastructure for multi-node training involves both GPUs and high-speed networking cards, like the ConnectX-7 and Bluefield, which have been tested and validated on VMware vSphere or VMware Cloud Foundation. NVIDIA recommends 200Gb/s networking bandwidth between servers/nodes for multi-node training, particularly for east-west traffic since it carries the model gradients between the training participants. For north-south networking traffic, i.e. for inference requests coming in from outside the enterprise, the recommendation is to use the Bluefield technology, thus offloading many of the security functions from the CPU.
The following graphic illustrates a multi-node training compute node with four GPUs, ConnectX-7 and Bluefield high-speed networking.
For Kubernetes environments on the VMware platforms, NVIDIA supplies the Network Operator as part of the NVIDIA AI Enterprise suite. The network operator eases the installation and ongoing management of the networking drivers (e.g., the MOFED, peer-to-peer drivers) onto the relevant Kubernetes nodes. More technical details on that multi-node, distributed setup are given in the Base Reference Architecture document from VMware.
This blog captures the key technologies from NVIDIA and VMware that together form a solid basis for your machine learning, LLM and generative AI developments. Together the two companies’ technologies offer a robust solution for LLM customization and model deployment for production inference.
We have reviewed a full suite of NVIDIA AI Enterprise Software for all the above from NVIDIA in this article. These depend on robust and scalable infrastructure to drive them. VMware Cloud Foundation and vSphere together provide the platform that is capable of quick deployment and rapid change that data scientists need to do their work effectively. VMware and NVIDIA are partnering to support you in this new and exciting field of generative AI.