What’s New in the VMware Private AI Foundation with NVIDIA – GA Release

When we announced the initial availability for VMware Private AI Foundation with NVIDIA at the NVIDIA GTC Conference in March 2024, we published a technical blog that describes the main technical features at the VMware layer shown in the blue section of the architecture.

VMware Private AI Foundation with NVIDIA is an add-on product to VMware Cloud Foundation. For a detailed walkthrough of the functionality, check out this technical overview A core principle of VMware Private AI Foundation with NVIDIA is that enterprises will choose a pre-trained model, bring it on-premises in their VCF data center and either fine-tune it with private data or use it in a Retrieval Augmented Generation design for their business applications.

With the announcement today that VMware Private AI Foundation with NVIDIA is generally available, we have some additional updates to show you that our engineers implemented while the GA version of the product was being prepared for delivery.

1. More Dashboards for GPU Metrics and Heatmaps in VCF Operations

At initial availability in March 2024, we showed you some of the VCF Operations capabilities for monitoring GPU core and GPU memory usage levels. Now, at GA, we have two new dashboard views of GPU usage across clusters of host servers.

Firstly, VCF Operations now shows you a dashboard summary of the GPU-equipped clusters in your environment. As seen here, we can quickly determine that GPU memory is at a premium in this cluster.

Secondly, the VMware engineers added visibility to the temperature at which the GPUs are operating, along with memory usage and compute utilization. This is important so that we can get early warnings of GPU overheating conditions that can happen at high application load. That high-level dashboard is shown here.

In this second VCF Operations dashboard, we see three panels measuring different key aspects of the GPUs in multiple clusters and across multiple host servers. Each colored box represents a server and groupings of those servers are seen in their named clusters. The cluster names are shown in the grey boxes at the top of each panel, with the host names beneath them. The measurements given in the three panels are GPU temperature (where red indicates a high condition), GPU memory usage (where red indicates low usage) and GPU Compute Utilization (where red indicates a low utilization). We made the latter two red indicators as these draw your attention to the fact that your GPUs are not being used to their full capacity.

By selecting a green GPU tile in this dashboard, we can drill down to see the detailed topology of the GPU's location, such as the data center, cluster, and host where the GPU is present, along with GPU properties (instance name and device type, seen on the bottom right panel). This is seen below where we have highlighted one panel for one host under the vcfw1-m02-cl01 cluster. We see the full topology view of that GPU's location in the Topology panel in the lower middle section of the screen. This gives us a precise location for any GPU that we are investigating for potential issues.

We can drill down in a different way using the Object Browser navigation from a vCenter Server (among many that exist in a VCF environment) to Cluster to Host to individual GPU device and to its temperature or other metrics to get more details if we need to. Temperature is presented for individual GPUs but not for the cluster here.

There are many more features for comparing host GPU usage across your clusters and troubleshooting issues that you will find in our VCF Operations documents.

2. Automation of the Infrastructure Setup for VMware Private AI with NVIDIA

To help our users deploy the new GA release of VMware Private AI Foundation with NVIDIA, VMware’s VCF engineering produced a set of four example PowerCLI scripts. These PowerCLI example scripts are available for customizing and use and can be downloaded from the Github page

The four initial example PowerCLI scripts help you to automate the setup of the infrastructure prerequisites for running the VMware Private AI Foundation with NVIDIA suite. A fully deployed VMware Private AI Foundation with NVIDIA infrastructure hosts a set of NVIDIA AI Enterprise containers/microservices that support Large Language Models and Retrieval Augmented Generation (RAG) designs. These are the NVIDIA Inference Microservice (NIM) and NVIDIA NeMo Retriever Microservice, among several others, as seen in the architecture picture at the start of this blog. These important parts are not directly deployed by the new PowerCLI example scripts. They will be downloaded on to a Deep Learning VM automatically at first boot time for the VM, where that Deep Learning VM is created through a VCF Automation process.

The example PowerCLI scripts focus on the setup of the underlying infrastructure. The scripts can take us all the way to the point of being ready to deploy deep learning VMs from the pre-loaded content library. That content library has an entry uploaded into it for the deep learning VM image, and this is managed by the administrator. The PowerCLI scripts prepare the VCF environment for further download and running the containers/microservices such as the NVIDIA NIM or Retriever that are key parts of the VMware Private AI Foundation with NVIDIA.

The example PowerCLI scripts help you with: -

Creation of a VCF workload domain involving a set of VCF host servers
Setting up the NVIDIA vGPU Manager (host driver) using vLCM, on your hosts
Creation of an NSX Edge Cluster within a workload domain, that will serve as the external access mechanism for connecting from your AI workload to the external world
Enabling the Workload Control plane (i.e. the Kubernetes Supervisor Cluster) for your vCenter within the Workload Domain. The Supervisor Cluster is used for the creation of deep learning VMs or Kubernetes nodes from the template in the VMware Private AI Foundation with NVIDIA content library. That content library is also created by the script. Along with it, a set of VM class objects are created. The VM class objects, where they reference GPUs specifically, make use of the example vGPU profiles that are present in the PowerCLI script. You would edit these entries to suit your own vGPU profiles that apply to your situation. The end user chooses a VM class to associate with VMs within their deployment, when creating a Deep Learning VM, or a node in a Kubernetes cluster. The content library and the appropriate VM class instances created are also associated together in this script.

The PowerCLI scripts must first be customized before using them for your own environment. Examples of the features within the PowerCLI scripts that are customized for your own installation are:

-local IP addresses,

-storage policy names,

-workload domain names,

-NSX Edge Cluster-specific names,

-content library names and VM image names,

-vGPU profiles required

-vCPU count for VMs,

-RAM sizes for VMs

among other customizable items

Once you have the scripts edited and organized for your own VCF environment, then they provide repeatability of the environment to other sets of hardware (or re-creation) in an automated way, making VMware Private AI with NVIDIA even easier to implement for your data science teams.

Summary

In the newly released Generally Available version of the VMware Private AI Foundation with NVIDIA suite, we see additional functionality in the form of (1) new GPU monitoring capabilities in VCF Operations and (2) new example PowerCLI scripts to help you with the setup of the infrastructure underlying your data science containers. These features, together with the powerful automation and deep learning VMs described earlier (blog) will speed your way towards deploying LLMs and GenAI-based applications on VMware Cloud Foundation.