Deploying Run-ai Atlas Platform on VMware Tanzu for Accelerated AI Infrastructure
Introduction
In today's competitive landscape, organizations are increasingly embracing artificial intelligence (AI) to unlock new possibilities and gain a competitive edge. However, deploying and managing a scalable AI infrastructure can be complex and time-consuming. This QuickStart Guide aims to provide a streamlined approach to deploying Run:ai's Atlas Platform on VMware Tanzu®, enabling organizations to expedite the setup of their AI infrastructure and maximize their AI initiatives' potential.
Run:ai has collaborated closely with VMware to unleash the power of their combined AI infrastructure capabilities. By deploying Run:ai's Atlas Platform on VMware Tanzu, organizations can leverage the scalability and agility of Kubernetes alongside the robust virtualization capabilities of VMware Tanzu. This integration empowers IT teams with granular control and visibility over their AI and ML infrastructure, simplifying the underlying complexities. Simultaneously, researchers and data scientists can utilize their preferred data science tools within a cloud-native AI environment, fostering innovation and accelerating time-to-market for AI solutions.
This Deployment Guide provides a comprehensive walkthrough for deploying Run:ai's Atlas Platform on NVIDIA AI Enterprise leveraging a VMware Tanzu Kubernetes cluster. Following the step-by-step instructions in this guide, IT organizations can confidently set up Run:ai's Atlas Platform on VMware. The guide explores technical intricacies, highlights essential considerations, and provides best practices for a successful implementation.
Business Challenges of GPU enabled Workloads
Managing GPU-enabled workloads across various environments, including bare metal, containers, virtual machines, or cloud, poses significant challenges for businesses. One of the primary challenges is the complexity of resource allocation and management. GPU resources are scarce and expensive, and efficiently distributing them among multiple workloads requires careful planning and optimization. Additionally, different environments have different requirements and configurations, making it difficult to ensure consistent performance and compatibility across platforms. Moreover, organizations face challenges related to visibility and control over GPU utilization, as it can be challenging to monitor and allocate resources effectively, leading to underutilization or contention issues. Lastly, there is a need for seamless integration and portability across environments, enabling workload mobility and flexibility while maintaining performance and security. Addressing these challenges is crucial to ensure smooth operations, maximize resource utilization, and drive the success of GPU-enabled workloads in today's demanding business landscape.
Run:ai Altas Platform
Run:ai's Atlas Platform revolutionizes AI infrastructure management by abstracting workloads from the underlying GPU and compute infrastructure. It creates a shared pool of resources that can be dynamically provisioned, enabling enterprises to fully utilize GPU compute across distributed teams. With Run:ai, data science and IT teams gain comprehensive control and real-time visibility into job run-time, queueing, and GPU utilization. This visibility extends beyond real-time monitoring, as the platform also provides historical metrics of cluster resources, empowering enterprises to make informed and data-driven decisions.
Figure 1. Run:ai on VMware Tanzu
The platform offers a virtual pool of resources, allowing IT leaders and data scientists to view and allocate compute resources seamlessly across multiple sites, whether on-premises or in the cloud. Built on top of Kubernetes, Run:ai integrates effortlessly with leading open-source frameworks commonly used by data scientists, ensuring a smooth integration into existing workflows.
It simplifies the process of creating, managing, and sharing development environments by enabling data scientists to self-provision workspaces in a secured way with their preferred model development tools (for example, Jupyter Notebooks, Weights & Biases and many more), the required compute resources and the data they need.
A core focus of Run:ai is to drive better utilization from compute clusters, enabling organizations to effectively allocate and utilize their GPUs, thus maximizing their return on investment (ROI) on hardware resources. By optimizing GPU utilization and resource allocation, Run:ai empowers enterprises to achieve greater efficiency and cost-effectiveness in their AI infrastructure.
NVIDIA NVAIE
NVIDIA NVAIE (NVIDIA AI Enterprise) is a comprehensive suite of software and hardware solutions designed to accelerate AI workloads across industries. It provides organizations with a powerful and scalable platform to leverage the capabilities of AI and deep learning. VMware, in collaboration with NVIDIA, offers a powerful combination of technologies that leverage vGPUs and Multi-Instance GPU (MIG) to optimize AI workloads on VMware vSphere® and VMware Tanzu.
Figure 2. NVIDIA AI Enterprise
NVIDIA NVAIE introduces the concept of virtual GPUs (vGPU), enabling multiple VMs or containers to share a single physical GPU. This technology enables organizations to achieve efficient resource utilization and enhanced flexibility within their VMware environment. By leveraging vGPUs, organizations can effectively consolidate their AI workloads, maximizing the utilization of GPU resources.
Taking the capabilities of vGPUs further, NVIDIA NVAIE introduces the MIG feature. With MIG, a single physical GPU can be partitioned into multiple smaller instances, each with its own dedicated resources. This fine-grained allocation allows organizations to run multiple AI workloads simultaneously on a single GPU, ensuring isolation and guaranteed performance. vGPU and MIG empowers VMware vSphere users to achieve optimal resource allocation and scalability for their AI workloads.
VMware vSphere with Tanzu
VMware vSphere with Tanzu, transforms traditional virtualization infrastructure into a robust platform for containerized workloads. The Tanzu Kubernetes Grid Service (TKGS) facilitates the creation and management of Tanzu Kubernetes clusters natively within vSphere, seamlessly integrating Kubernetes capabilities with the reliable features of vSphere. With vSphere's networking, storage, security, and high availability features, organizations achieve better visibility and operational simplicity for their hybrid application environments.
Figure 3. vSphere with Tanzu
vSphere with Tanzu enables organizations to run application components in containers or virtual machines (VMs) on a unified platform, streamlining workload management and deployment. By converging containers and VMs, IT teams can leverage the benefits of both paradigms while maintaining a cohesive and scalable infrastructure.
Run:ai Architecture
Figure 4. Run:ai Architecture
Run:ai is installed as a Kubernetes operator on a Tanzu Kubernetes Grid Cluster (TKG), this TKG cluster provides vSphere infrastructure resources and the associated GPUs. Researchers submit machine learning workloads via the Run:ai command line interface (CLI), through the GUI interface, or directly by sending the YAML files to Kubernetes.
Run:ai supports managing multiple clusters, a single SaaS tenant can easily have registered multiple TKG clusters across different vSphere environments or within a single environment. This provides customers with the flexibility of designing vSphere clusters according to their business use cases and needs, different hardware profiles alongside GPU models can be easily consumed based on specific business rules.
There are two main components of Run:ai’s architecture: Run:ai’s cluster and control plane.
Figure 5. Run:ai Cluster
Run:ai Cluster
The Run:ai cluster contains all the components that are deployed on the TKG Cluster, these components run as Kubernetes pods and they provide the local logic and functionalities of Run:ai. Here is an example of the pod deployed as part the Run:ai cluster on Tanzu Kubernetes:
The Run:ai scheduler extends the default Kubernetes scheduler, enabling business rules based on project quotas to schedule workloads sent by researchers and data scientists. Another key functionality is Fractional GPU management, responsible for the Run:ai technology which allows researchers to allocate parts of a GPU rather than a whole GPU, this can be done with multiple technologies, like Dynamic MIG (which leverages MIG capabilities of Ampere based GPUs), vGPU that is part of NVIDIA’s NVAIE stack and passthrough technologies like Dynamic DirectPath I/O. The Run:ai agent, along with other monitoring components like Prometheus, also has the duty of forwarding monitoring statistics to the Run:ai control plane.
A Run:ai cluster is installed in its own Kubernetes namespace named Run:ai and it is easily deployed via Helm, we will cover the deployment process later as part of the installation process section of this document. If the control plane is a SaaS tenant, the cluster will require internet access to communicate back to it.
Apart from the comprehensive suite of tools that Run:ai provides, the platform is designed to adapt to the preferred utilities of your data scientists and researchers. The power of integrating Run:ai into your workflow is highlighted by its ability to significantly improve the scheduling and orchestration of any K8s container-based workload. As a result, researchers can continue to use the tools they are familiar with, but with the added advantage of the enhanced GPU management and scheduling capabilities provided by Run:ai. The platform has demonstrated seamless compatibility with tools such as Kubeflow, MLflow, Airflow, JupyterHub, and others.
Run:ai Control Plane
The Run:ai control plane plays a crucial role in collecting and consolidating monitoring and performance metrics associated with computing infrastructure and ongoing tasks from the deployed Run:ai clusters. This component enables the integration of numerous clusters to a single backend, providing a unified view for managing and monitoring multiple clusters.
Figure 6. Run:ai Control Plane
For SaaS customers, Run:ai takes full responsibility for maintaining the control plane. Moreover, it serves as a single point for implementing administrative modifications to the platform. These include the integration with Single Sign-On (SSO) systems, managing projects, setting resource quotas, user creation, and others. Additionally, Run:ai offers an on-premises control plane deployment for those customers working in air-gapped environments.
The cluster transmits pertinent data and metrics to the control plane, enabling effective control and providing insightful visualizations through dashboards. Rest assured that Run:ai adheres to stringent data retention policies on its SaaS platform. Run:ai ensures no deep-learning artifacts are transmitted to the cloud, thus all code, images, container logs, training data, models, checkpoints, and similar data remain securely within the boundaries of corporate firewalls.
Deploying Run:ai on vSphere with Tanzu and NVAIE
Before we begin the deployment process, it is important to first verify that the prerequisites are met and to address the VMware infrastructure assumptions made in this document. Below we reproduce the table of requirements for your convenience but you can get the latest version from this Run:ai web page.
Table1. Run:ai Prerequisites
Prerequisite |
Details |
Used in this guide |
Kubernetes Version |
Now Run:ai supports Kubernetes versions 1.21 through 1.26. |
v1.24.4+vmware.1-vsc0.1.1-20884059 included with vSphere 8 GA. |
NVIDIA GPU Operator |
Run:ai requires NVIDIA GPU Operator version 1.9 or 22.9.x |
Version 22.9.1 |
Ingress Controller |
Run:ai requires an ingress controller as a prerequisite. The Run:ai cluster installation configures one or more ingress objects on top of the controller. |
NGINX. |
Prometheus |
If not already installed on your cluster, install the full kube-prometheus-stack through the Prometheus community Operator. |
Version 46.4.2 |
Domain Name |
You must supply a domain name as well as a trusted certificate for that domain. |
N/A |
Hardware |
Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, Run:ai recommends that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai the following resources are needed: ● 4 CPUs ● 8GB of RAM ● 50GB of Disk space
|
For this quick-start guide we do not have dedicated worker nodes since it is not a production grade architecture, we are running all the TKC cluster components on the same GPU-enabled worker nodes. |
Table 2. VMware Infrastructure Assumptions
Prerequisite |
Details |
Used in this guide |
vSphere |
For this guide, we assume that a working vSphere 8 (or newer version) is properly designed and installed, ready to deploy WCP. |
ESXi version – 8.0, Build 20513097.
vCenter version – 8.0, Build 20920323. |
VMware vSphere Distributed Switch™ (vDS) networking for Workload Management |
For this guide, we use vDS networking, other networking architectures such as NSX or ALB for Load Balancing are supported but not part of the scope of this document. |
N/A |
HAProxy Loadbalancer |
VMware provides an implementation of the open source HAProxy load balancer that you can use in your vSphere with Tanzu environment. If you are using vSphere Distributed Switch (vDS) networking for Workload Management, you can install and configure the HAProxy load balancer. |
v0.2.0 |
GPU Enabled ESXi hosts |
Our goal as part of this guide is to enable a TKG cluster for GPU-based workloads and test Run:ai’s main GPU functionalities, Dynamic MIG and GPU Fractions. |
For this guide, we used an ESXi cluster with NVIDIA V100s (Volta Architecture) and A100s (Ampere Architecture) |
Tanzu Networks |
This QuickStart guide leverages HAProxy as the Load balancer solution, we are using a 3-interface configuration for HAProxy: Management network, single workload network and a dedicated network for frontend communication. |
Refer to vSphere 8 Install and Configure the HAProxy Load Balancer Guide. |
Tanzu CLI |
The Tanzu CLI then communicates with the management cluster to create and manage workload clusters on the cloud provider. This guide assume that you have Tanzu CLI installed and ready. |
Refer to the Install the Tanzu CLI and Other Tools guide for more information. |
Helm |
This guide assumes that helm package manager is installed on the system where Tanzu CLI will be used |
Refer to the Installing Helm documentation. |
NVIDIA NGC API Key and DLS/CLS |
|
|
Installation Process
There are multiple phases that need to be followed before installing Run:ai on a Tanzu Kubernetes Cluster. We will start by preparing the underlying vSphere infrastructure, then move on to the Tanzu Supervisor. Finally, we will cover the preparation and installation process on the TKC cluster.
Phase 1. vSphere Preparation
We will start by covering how to configure the NVIDIA GPU with vGPU mode, we will also cover how to configure MIG. It is important to note that we will use Passthrough with Run:ai’s Dynamic MIG technology for the examples shown on this guide and we will cover the procedure to setup Dynamic MIG later in this document.
Login to vSphere Client and navigate to vCenter > Datacenter > Cluster > Host > Configure > Graphics and click on Host Graphics.
Under the Host Graphics, click Edit and choose Shared Direct and Spread VMs across GPUs. Reboot the host after you change the settings.
Alternatively, you can access the ESXi host server either using the ESXi shell or through SSH and execute the following command:
esxcli graphics host set –-default-type SharedPassthru
After the ESXi host has rebooted, verify that the settings have taken using the command line by typing the following command:
esxcli graphics host get
Installing NVIDIA Virtual GPU Manager
Follow this procedure for each ESXi server with a GPU device installed.
Start by setting the ESXi host into maintenance mode by right-clicking the host and select Maintenance Mode > Enter Maintenance Mode.
SSH into ESXi server with root and and remove any older version of NVIDIA Virtual GPU Manager vSphere Installation Bundle if installed:
esxcli software vib remove -n `esxcli software vib list | grep "NVIDIA-AIE\|NVIDIA-VMware_ESXi" | awk '{print $1}'`
Install the NVIDIA Virtual GPU Manager VIB (VMware Installation Bundle) by running the following command:
esxcli software vib install –v Absolute_Path_of_Directory_of_the_VIB_File/NVIDIA-AIE*.vib
Reboot the ESXi host and Exit maintenance mode. Once rebooted, verify that the NVIDIA kernel driver can successfully communicate with the physical GPUs in the system by running the nvidia-smi command without any options.
nvidia-smi
If successful, the above command lists all the GPUs in the system (example of a host with 2x NVIDIA A100 GPUs):
NVIDIA MIG
The following procedure is optional, and should only be followed when MIG is required or if you are planning on using Run:ai’s Dynamic MIG technology. Time-sliced is the default NVIDIA vGPU mode, it does not offer complete hardware-level isolation between VMs that share a GPU. The streaming multiprocessor (SM) arranges tasks onto groups of cores in accordance with user-accessible methods like fair-share, equal share, or best effort. In time-sliced NVIDIA vGPU, all of the GPU's cores, as well as all of the hardware routes through the cache and cross-bars to the GPU's frame buffer memory, are potentially employed. Please check the references section for more information on NVIDIA MIG.
To enable MIG on a particular GPU on the ESXi host server, run the following command:
nvidia-smi -i 0 -mig 1
where -i refers to the physical GPU ID, since there may be more than one physical GPU on a machine. If no GPU ID is specified, then MIG mode is applied to all the GPUs on the system.
Continue with a second command as follows:
nvidia-smi –gpu-reset
That second command resets the GPU only if there are no processes running that currently use the GPU. If an error message appears from the nvidia-smi –gpu-reset command indicating that there are running processes using the GPU, make sure any process is stopped and then reboot the ESXi server.
After the reboot, when the host is back up to maintenance mode, issue the command:
nvidia-smi
and you should see that MIG is now enabled:
Phase 2. vSphere with Tanzu Installation and Configuration
There are a few prerequisites for us to enable Workload Management on vSphere: a content library to store Tanzu Images and a VM Storage Policy. For the purposes of this guide, we will create a single storage policy that aligns with the available shared storage. In this example, we used "nfs-gold" to map to the NFS datastore assigned to the ESXi cluster.
We will create two content libraries, one for HA Proxy and a second one for TKR images, to create a library, begin by logging into the vSphere Client and then from the top Menu select Content Libraries.
Next, click the “+” (plus sign) to open the New Content Library wizard. We will need to specify the Content Library Name and any required Notes. Next, select the vCenter Server that will manage the new library. Once the library Name and Notes have been entered, click Next to continue.
Here is where the Content Library setup and configuration take place. We can either set this library to be a Local Content Library or we can Subscribe to another library by providing the Subscription URL of that Published Library.
In this example we will choose; Local Subscribed Content Library and subscribe it to the following URL http://wp-content.vmware.com/v2/latest/lib.json.
Select a storage location for the library’s contents. This can be vSphere datastore that exists in inventory or an SMB or NFS path to storage. In this example, an NFS datastore is being used as the backing datastore for the Content Library. Choose the datastore and then click Next to continue. Review the settings chosen for the Content Library before clicking Finish to complete. Repeat the process but this time selecting “Local Content Library”, this second Library will be used to simplify the HAProxy deployment, finish by uploading the HAProxy template.
HAProxy Deployment
Deploy an HAProxy appliance to serve as the Load balancer for the supervisor environment, NSX and ALB are also supported but are not part of the scope of this guide.
Select the "Frontend" configuration during the template deployment process, this configures three interfaces for the HAProxy appliance: Management, Workload, and Frontend.
Follow all the wizard steps, here is a summary of the settings configured based on the available VLANs for this guide:
1.2. Permit Root Login = True
1.3. TLS Certificate Authority Certificate (ca.crt) =
1.4. TLS Certificate Authority Private Key (ca.key) =
2.1. Host Name = haproxy.local
2.2. DNS = 10.142.7.1,10.132.7.1
2.3. Management IP = 10.203.80.68/24
2.4. Management Gateway = 10.203.80.253
2.5. Workload IP = 10.203.86.10/25
2.6. Workload Gateway = 10.203.86.125
2.7. Additional Workload Networks =
2.8. Frontend IP = 10.203.86.130/25
2.9. Frontend Gateway = 10.203.86.253
3.1. Load Balancer IP Ranges, comma-separated in CIDR format (For example 1.2.3.4/28,5.6.7.8/28) = 10.203.86.144/28 --- NOTICE THIS IS A SUBSET OF VLAN 1005 (SUBNETTING)
3.2. Dataplane API Management Port = 5556
3.3. HAProxy User ID = admin
After HAProxy has been deployed, connect via ssh appliance and copy the contents of /etc/haproxy/ca.crt , this will be used later during the Workload Management deployment phase:
Phase 3. Workload Management Configuration
Now that we have all the required pre-requisites, it is time to enable workload management on vSphere. From the home menu, select Workload Management.
Select a licensing option for the Supervisor Cluster. If you have a valid Tanzu Edition license, click Add License to add the license key to the license inventory of vSphere.
If you do not have a Tanzu edition license yet, enter the contact details so that you can receive communication from VMware and click Get Started.
The evaluation period of a Supervisor Cluster lasts for 60 days. Within that period, you must assign a valid Tanzu Edition license to the cluster. If you add a Tanzu Edition license key, you can assign that key within the 60-day evaluation period once you complete the Supervisor Cluster setup.
On the Workload Management screen, click Get Started again. Select a vCenter Server system, select vCenter Server Network, and click Next.
Select a cluster from the list of compatible cluster and once on the Control Plane Size page, select the size for the Kubernetes control plane VMs that will be created on each host from the cluster and click next.
Select the storage policy created earlier on this guide and click next. Complete the information related to your HAProxy deployment. The Data plane API Address should be the management IP address with port 5556 appended, the IP Address Ranges for Virtual Servers should be part of the Load Balancer IP Ranges specified during HAProxy deployment and the Server Certificate Authority should be the contents of the HAProxy certificate obtained earlier.
Option |
Details |
Used in this guide |
Data Plane API Address |
The IP address and port of the HAProxy Data Plane API. This component controls the HAProxy server and runs inside the HAProxy VM. This is the management network IP address of the HAProxy appliance. |
10.203.80.68:5556 |
User name and Password |
The user name and password that is configured with the HAProxy OVA file. You use this name to authenticate with the HAProxy Data Plane API. |
N/A |
IP Address Ranges for Virtual Servers |
Range of IP addresses that is used in the Workload Network by Tanzu Kubernetes clusters. This IP range comes from the list of IPs that were defined in the CIDR you configured during the HAProxy appliance deployment. Typically, this will be the entire range specified in the HAProxy deployment, but it can also be a subset of that CIDR because you may create multiple Supervisor Clusters and use IPs from that CIDR range. This range must not overlap with the IP range defined for the Workload Network in this wizard. The range must also not overlap with any DHCP scope on this Workload Network. |
10.203.86.145 - 10.203.86.158 |
Server Certificate Authority |
The certificate in PEM format that is signed or is a trusted root of the server certificate that the Data Plane API presents. We copied the contents of the .pem file on a previous step. |
N/A |
On the Management Network screen, configure the parameters for the network that will be used for Kubernetes control plane VMs.
Option |
Details |
Used in this guide |
Network |
Select a network that has a VMkernel adapter configured for the management traffic. |
Management Network |
Starting Control IP address |
Enter an IP address that determines the starting point for reserving five consecutive IP addresses for the Kubernetes control plane VMs as follows: ▪ An IP address for each of the Kubernetes control plane VMs. ▪ A floating IP address for one of the Kubernetes control plane VMs to serve as an interface to the management network. The control plane VM that has the floating IP address assigned acts as a leading VM for all three Kubernetes control plane VMs. The floating IP moves to the control plane node that is the etcd leader in this Kubernetes cluster. This improves availability in the case of a network partition event. ▪ An IP address to serve as a buffer in case a Kubernetes control plane VM fails and a new control plane VM is being brought up to replace it. |
10.203.80.161 |
Subnet Mask |
Only applicable for static IP configuration. Enter the subnet mask for the management network. |
255.255.255.0 |
DNS Servers |
Enter the addresses of the DNS servers that you use in your environment. If the vCenter Server system is registered with an FQDN, you must enter the IP addresses of the DNS servers that you use with the vSphere environment so that the FQDN is resolvable in the Supervisor Cluster. |
N/A |
DNS Search Domains |
Enter domain names that DNS searches inside the Kubernetes control plane nodes, such as corp.local, so that the DNS server can resolve them. |
eng.vmware.com |
NTP |
Enter the addresses of the NTP servers that you use in your environment, if any. |
N/A |
You can leave the default IP address for Services value in place or enter the desired network. Enter an appropriate IP address for your DNS Server.
Option |
Details |
Used in this guide |
IP address for Services |
Enter a CIDR notation that determines the range of IP addresses for Tanzu Kubernetes clusters and services that run inside the clusters. |
10.96.0.0/23 (auto-assigned) |
DNS Servers |
Enter the IP addresses of the DNS servers that you use with your environment, if any. |
N/A |
In the Workload Network page, enter the settings for the network that will handle the networking traffic for Kubernetes workloads running on the Supervisor Cluster.
Option |
Details |
Used in this guide |
Portgroup |
Select the port group that will serve as the Primary Workload Network to the Supervisor Cluster |
workload-network |
DNS Servers |
Enter the IP addresses of the DNS servers that you use with your environment, if any. |
N/A |
Phase 4. Supervisor Configuration
After configuring Workload Management, the next step is to create a vSphere namespace for deploying a TKG cluster, and within it, we need to create another namespace for NVIDIA GPU Operator will need to choose the specific supervisor, provide a name for the namespace (runai-poc in this example), and select the desired network, then click Create:
Confirm that the newly created vSphere Namespace is ready:
We will now configure permissions for this namespace, we will set administrator@vsphere.local as the owner, click on Manage Permissions then click add, select the SSO vsphere.local domain as the identity source, search for administrator and then select the owner role:
Confirm that administrator@vsphere.local is listed:
Now we will go ahead and create a VM Class for the NVIDIA A100 GPU in MIG mode, click on the namespace that was just created and then click VM Service - Go to VM Service and then click con Create VM class:
Configure the desired resources for the VM class and mark the checkbox PCI Devices, then click Next:
Click on Add PCI Device and select NVIDIA vGPU from the dropdown:
Select the hardware Model, set the GPU sharing mode to MIG, fill the desired amount of memory and GPUs, for this example we used 1 GPU with 40GB in MIG mode, click Next:
Review and confirm the configured settings and click Finish
Tanzu Kubernetes Cluster Creation
We will now create a Tanzu Kubernetes Cluster, this cluster will be providing the required resources to run workloads and the ability to deploy the NVIDIA GPU Operator. Here is the YAML that we use on this guide, adjust it accordingly to your needs:
apiVersion: run.tanzu.vmware.com/v1alpha3
kind: TanzuKubernetesCluster
metadata:
name: vgpu
namespace: runai-poc
annotations:
run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
spec:
topology:
controlPlane:
replicas: 3
storageClass: nfs-gold
vmClass: guaranteed-large
tkr:
reference:
name: v1.23.8---vmware.2-tkg.2-zshippable
volumes:
- name: etcd
mountPath: /var/lib/etcd
capacity:
storage: 4Gi
nodePools:
- name: nodepool-v100-passthrough
replicas: 1
storageClass: nfs-gold
vmClass: v100pass
tkr:
reference:
name: v1.23.8---vmware.2-tkg.2-zshippable
volumes:
- name: containerd
mountPath: /var/lib/containerd
capacity:
storage: 70Gi
- name: kubelet
mountPath: /var/lib/kubelet
capacity:
storage: 70Gi
- name: nodepool-a100-mig
replicas: 1
storageClass: nfs-gold
vmClass: a100mig
tkr:
reference:
name: v1.23.8---vmware.2-tkg.2-zshippable
volumes:
- name: containerd
mountPath: /var/lib/containerd
capacity:
storage: 70Gi
- name: kubelet
mountPath: /var/lib/kubelet
capacity:
storage: 70Gi
settings:
storage:
defaultClass: nfs-gold
A few notes about this example YAML:
- You can specify the type of gOS for the worker nodes to be Ubuntu, it is the only supported gOS for the GPU operator to work, line 7 has the following annotation "run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu" which sets all the running pods (worker and control plane) to be ubuntu. This requires zshippable TKR images.
- This YAML deploys two types of workers, the first one with the "a100mig" VM Class and the second one with the "v100pass" VM Class, this is not required, we used this YAML to test MIG and passthrough and that is the reason why we have both, if you just want to deploy a MIG based workload then just comment the relevant lines.
Phase 5. NVIDIA GPU Operator Deployment
After the Tanzu Kubernetes Cluster was created, we will need to create a namespace called gpu-operator on that cluster, here is where all the NVIDIA GPU operator components will be deployed:
Log into the system where you have the Tanzu CLIs deployed (where you can run kubectl), and execute the following:
kubectl vsphere login --server=https://10.203.86.146 --tanzu-kubernetes-cluster-name vgpu --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify
Create a namespace called "gpu-operator":
kubectl create namespace gpu-operator
After the namespace was created, confirm that the namespace was successfully created and is already listed:
kubectl get namespace
Create an empty vGPU license configuration file:
sudo touch gridd.conf
After you have this file created, you will need to generate and download a NLS client license token (.tok) file, this must be done on your NVIDIA licensing server and rename the .tok file to client_configuration_token.tok
Now we will go ahead a create a configmap, a configmap allows you to store non-confidential data in key-value pairs. Create the licensing-config ConfigMap object in the gpu-operator namespace. Both the vGPU license configuration file and the NLS client license token will be added to this ConfigMap.
kubectl create configmap licensing-config -n gpu-operator --from-file=gridd.conf --from-file=<path>/client_configuration_token.tok
You can confirm that the contents of the configmap were successfully populated by doing a describe:
kubectl describe configmaps licensing-config -n gpu-operator
Now we need to create a pull secret in the gpu-operator namespace. A Secret is an object that contains a small amount of sensitive data such as a password, a token, or a key. Such information might otherwise be put in a Pod specification or in a container image. Using a Secret means that you do not need to include confidential data in your application code. We will use this secret to pull the required image(s) from NVIDIA's private NGC registry:
export REGISTRY_SECRET_NAME=ngc-secret
export PRIVATE_REGISTRY=nvcr.io/nvaie
kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
--docker-server=${PRIVATE_REGISTRY} \
--docker-username='$oauthtoken' \
--docker-password=${NGC_API_KEY} \
--docker-email='YOUREMAIL \
-n gpu-operator
Add the NVIDIA AI Enterprise Helm repository, where password is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated:
helm repo add nvaie https://helm.ngc.nvidia.com/nvaie \
--username='$oauthtoken' --password=${NGC_API_KEY} \
&& helm repo update
Create ClusterRoleBinding that grants access to authenticated users run a privileged set of workloads using the default PSP vmware-system-privileged:
kubectl create clusterrolebinding psp:authenticated --clusterrole=psp:vmware-system-privileged --group=system:authenticated
Install the NVIDIA GPU Operator by executing the following command:
helm install --wait gpu-operator nvaie/gpu-operator-3-0 -n gpu-operator --set driver.repository=nvcr.io/nvaie --set operator.repository=nvcr.io/nvaie --set driver.imagePullPolicy=Always --set migStrategy=mixed
Confirm the helm chart is listed after the deployment process by executing the following command:
helm list -n gpu-operator
The process of pulling container images from NVIDIA’s NGC registry and getting all PODs that constitute the GPU operator could take 10-15 minutes, confirm that all PODs are on a Running or Completed status:
kubectl get pods -n gpu-operator
Phase 6. Run:ai Installation and Configuration
Once we have our Tanzu Kubernetes cluster configured and ready with the GPU operator, it is time to install Run:ai, for this section it is assumed that you already have a Run:ai tenant and have access to the internet as well as the certificates to be used.
Access your tenant and click Clusters on the left-hand side, then click on New Cluster, enter a name for your new cluster and click on Create.
A new window for cluster installation will appear, you will need to select "on premise" (since this is an on-premise Tanzu environment) and enter the specific URL for your tenant, for this guide we are using https://vmware.run.ai:
Copy the command from the Install Run:AI on a Cluster, Step 4 screen, edit the cert and key paths accordingly and execute it on the OS where you have kubectl installed. Here is an example of how it looks:
kubectl create ns runai && kubectl create secret tls -n runai runai-cluster-domain-tls-secret --cert /home/octo/runai/certs/runai-poc-com.crt --key /home/octo/runai/certs/runai-poc-com.key
This command will create a namespace "runai" and create a secret for that namespace based on the certificate provided. A Kubernetes secret is a small amount of sensitive data contained in an object such as a password, a token, or a key. Alternatively, such information could be included in a Pod specification or a container image. When you use a Secret, you avoid having to include confidential data in your application code, in this case it will be used to store your organizational TLS certificate and its key for your cluster URL.
Now copy the curl command on step 5 and execute it, this curl command will get all the cluster values and create a YAML file an output, here is an example of such a YAML:
Add the required helm repositories for nginx, Run:ai and Prometheus and do a Helm repo update:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add runai https://run-ai-charts.storage.googleapis.com
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Once the available repositories are available, we can proceed to install the required charts for nginx and Prometheus:
helm install nginx-ingress ingress-nginx/ingress-nginx \ --namespace nginx-ingress --create-namespace
helm install prometheus prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace --set grafana.enabled=false
Confirm that nginx is set as an ingress service:
kubectl get service -A | grep ingress
Now all the prerequisites for Run:ai are in place, we will proceed with Run:ai’s helm installation, we use the values.yaml file that was created with the curl command on a previous step:
helm upgrade -i runai-cluster runai/runai-cluster -n runai -f nameofyouryaml.yaml --create-namespace
Confirm that an ingress object was created for the Run:ai’s researcher pod:
At this point you should have the Run:ai cluster pods running on the Run:ai namespace as well as Prometheus running on the monitoring namespace:
kubectl get pods -n runai
kubectl get pods -n monitoring
Confirm that the new cluster is showing as connected on the Run:ai web portal, also verify that Grafana is populating the graphs on the web portal from information provided by Prometheus:
Now we will need to confirm that the worker nodes of the Tanzu Kubernetes Cluster that have vGPUs assigned are showing correctly, to do so, navigate to Nodes:
We need to configure authentication for the GPU cluster, this will validate the users against the identity source that is configured on Run:ai, there are three options: the built in identity service of the Run:ai platform or a SAML or OIDC provider. For this guide, we are using users that will be authenticating against the internal identity service. This step will enable authentication of the Kubernetes API server against Run:ai's identity service. On the left-hand pane click on General:
We will copy the contents of Server Configuration and add it to the /etc/kubernetes/manifests/kube-apiserver.yaml present on the Tanzu Kubernetes Cluster Control Plane VMs. First, we will need to get the SSH password for the Kubernetes control Plane VMs. We will start by getting the secrets on the vSphere namespace where the Kubernetes cluster was deployed, for this guide it is runai-poc:
kubectl get secrets -n runai-poc
You can see that for this specific cluster called vgpu there is a secret called vgpu-ssh-password, a base64 decode operation needs to be executed to get the password in plain text, execute the following command:
kubectl -n runai-poc get secrets vgpu-ssh-password -o jsonpath={.data.ssh-passwordkey} | base64 -d
Here’s an example of the output:
Open an SSH session to the Kubernetes cluster control plane VMs with user vmware-system-user and the password that you just decrypted in the previous step:
Edit the file /etc/kubernetes/manifests/kube-apiserver.yaml and add the Server Configuration contents from Run:ai under the "-command" section, reboot the VM after saving the changes:
Navigate to the job section of Run:ai’s web console, and click + new job:
At this point, Run:ai's SaaS control plane should be able to communicate with the researcher service pod, allowing you to start a new job:
We will now install the Run:ai admin CLI, The Run:ai Administrator Command-line Interface (Administrator CLI) allows performing administrative tasks on the Run:ai Cluster. The Run:ai admin cli should be run with the admin kubeconfig file, this tool will allow us to do multiple things, including labeling worker nodes as dynamic-mig capable.
Open an SSH session to the system where you have all your utilities (Tanzu CLI) and enter the following commands:
wget --content-disposition https://app.run.ai/v1/k8s/admin-cli/linux
chmod +x runai-adm
sudo mv runai-adm /usr/local/bin/runai-adm
Once installed confirm the version with "runai-adm version":
NVIDIA MIG allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be partitioned into separate GPU Instances.
- When divided, the portion acts as a fully independent GPU.
- The division is static, in the sense that you must call NVIDIA API or the nvidia-smi command to create or remove the MIG partition.
- The division is both of compute and memory.
- The division has fixed sizes. Up to 7 units of compute and memory in fixed sizes. The various MIG profiles can be found in the NVIDIA documentation. A typical profile can be MIG 2g.10gb which provides 2/7 of the compute power and 10GB of RAM
- Reconfiguration of MIG profiles on the GPU requires administrator permissions and the draining of all running workloads.
Run:ai provides a way to dynamically provision a MIG partition:
- Using the same experience as the Fractions technology, if you know that your code needs 4GB of RAM. You can use the flag --gpu-memory 4G to specify the portion of the GPU memory that you need. Run:ai will call the NVIDIA MIG API to generate the smallest possible MIG profile for your request, and allocate it to your container.
- MIG is configured on the fly according to workload demand, without needing to drain workloads or to involve an IT administrator.
- Run:ai will automatically deallocate the partition when the workload finishes. This happens in a lazy fashion in the sense that the partition will not be removed until the scheduler decides that it is needed elsewhere.
- Run:ai provides an additional flag to dynamically create the specific MIG partition in NVIDIA terminology. As such, you can specify --mig-profile 2g.10gb.
- In a cluster of multiple MIG enabled GPUs, you can choose to leverage on a node level some nodes with dynamic MIG and other nodes with Run:ai's fractional technology.
To use Dynamic MIG, the GPU Operator must be installed with the flag mig.strategy=mixed,this guide uses that flag as part of the GPU installation process. If the GPU Operator is already installed, edit the clusterPolicy by running:
kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}
Run:ai needs leverages labels for the worker nodes that will be used for dynamic MIG, list the current worker nodes:
kubectl get nodes
Use Run:ai’s admin CLI to set the nodes to dynamic mig mode enabled with the following command:
runai-adm set node-role --dynamic-mig-enabled <node-name>
We will proceed to label the nodes with the label node-role.kubernetes.io/runai-mig-enabled=true
kubectl label node <node-name> node-role.kubernetes.io/runai-mig-enabled=true
Now we will configure MIG on the worker node itself, for this guide we disabled MIG at the ESXi level and just did a passthrough with Dynamic DirectPath I/O for one of the A100 GPUs, the other one is set to vGPU. This will allow the MIG manager pod (part of Run:ai) to manage the MIG profiles dynamically. For this purpose, we need open a bash session into the nvidia-driver-daemonset pod running on the a100 worker node that was configured as passthrough, first we need to identify the node where this pod is instantiated:
kubectl get pods -n gpu-operator -o wide
Once identified the correct pod, open a bash to that container with the following command:
kubectl exec nvidia-driver-daemonset-g56tf -it -n gpu-operator – bash
You can confirm that MIG is not enabled on the pod with nvidia-smi:
We will enable MIG with the following command:
nvidia-smi -i 0 -mig 1
--gpu-reset is not supported in the container, a reboot of the worker node is required, get the password by decrypting the ssh secret for the vgpu cluster and reboot the node with an SSH session with user Vmware-system-user.
Once the worker node has rebooted connect back to the nvidia-driver-daemonset pod that is assigned to the specific worker node that has the MIG capable GPU, give it some time for the nvidia-device-plugin-daemonset to be running, once inside the bash of the driver daemonset we need to confirm that mig is now showing as configured with nvidia-smi:
There are a few extra steps needed, first we will start by manually creating the first MIG profile within that container with nvidia-smi, this will let the MIG manager initialize the GPU, this profile can then be dynamically changed by Run:ai
nvidia-smi mig -cgi 3g.20gb -C
If Dynamic MIG is enabled you need to scale down the daemonset for the NVIDIA GPU Operator MIG Manager so both, Run:ai's MIG manager and NVIDIA's MIG manager are not in a race condition. By default, the nvidia-mig-manager daemonset is set to have 1 copy of the pod so we are going to scale it down to 0 with the following command:
kubectl -n gpu-operator patch daemonset nvidia-mig-manager -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
Confirm that the change to the daemonset is reflecting now:
kubectl get daemonset -n gpu-operator
Delete the MIG manager pod on the Run:ai namespace, this will get a new copy created:
kubectl delete pod runai-mig-manager-cltdc -n runai
Dynamic MIG jobs should be available now, go back to the web console and create a job with a MIG profile:
Confirm that the MIG profile was dynamically set by describing the pod where the job was scheduled, on the web console click the job and go to PODS, take note of the pod:
Describing the assigned POD gives us a list of annotations, including the assigned MIG profile: nvidia.com/mig-7g.40gb:1
Alternatively, you can open a bash session to the nvidia-driver-daemonset POD and confirm with nvidia-smi:
nvidia-smi -lgi
Phase 7. Validation
Run:ai provides multiple Quick Start Guides. The purpose of these guides is to get you acquainted with an aspect of Run:ai in the simplest possible form. You can follow any of the QuickStart documents below to learn more about a particular functionality.:
- Unattended training sessions
- Using GPU Fractions
- Distributed Training
- Hyperparameter Optimization
- Over-Quota, Basic Fairness & Bin Packing
- Fairness
- Inference
- Dynamic MIG
To validate your Run:ai cluster is working properly, we will launch a series of unattended training sessions.
Setup
- Follow the Install the Run:ai Command-line Interface procedure to get ready to submit jobs to execution.
- Login to the Projects area of the Run:ai user interface.
- Add a Project named "team-a".
- Allocate 2 GPUs to the Project.
Run Workload
At the command-line run:
runai config project team-a
runai submit train1 -i gcr.io/run-ai-demo/quickstart -g 1
This would start an unattended training Job for team-a with an allocation of a single GPU. The Job is based on a sample docker image gcr.io/run-ai-demo/quickstart. We named the Job train1.
Follow up on the Job's progress by running:
runai list jobs
The result:
Typical statuses you may see:
- ContainerCreating - The docker container is being downloaded from the cloud repository
- Pending - the Job is waiting to be scheduled
- Running - the Job is running
- Succeeded - the Job has ended
A full list of Job statuses can be found here.
To get additional status on your Job, run:
runai describe job train1
View Logs
Run the following:
runai logs train1
You should see a log of a running deep learning session:
View status on the Run:ai User Interface
- Open the Run:ai user interface.
- Under "Jobs" you can view the new Workload:
The image we used for training includes the Run:ai Training library. Among other features, this library allows the reporting of metrics from within the deep learning Job. Metrics such as progress, accuracy, loss, and epoch and step numbers.
- Progress can be seen in the status column above.
- To see other metrics, press the settings wheel on the top right and select additional deep learning metrics from the list
Under Nodes you can see node utilization:
Stop Workload
Run the following:
runai delete job train1
This would stop the training workload. You can verify this by running runai list jobs again.
Conclusion
In conclusion, this guide provides an in-depth walkthrough for deploying the Run:ai's Atlas Platform on NVIDIA AI Enterprise using a VMware Tanzu Kubernetes cluster. It underscores the benefits of this integration, including enhanced control over AI infrastructure and improved GPU resource utilization. By leveraging Run:ai, VMware Tanzu, and NVIDIA NVAIE, organizations can overcome challenges associated with GPU-enabled workloads, maximize return on investment, and streamline their AI operations. Run:ai's flexible architecture supports multiple clusters, integrates with popular data science tools, and enhances GPU management, leading to optimized workflows. Lastly, the Run:ai control plane ensures stringent data retention policies and secure handling of all data. This combination of technologies offers a potent solution to streamline AI infrastructure management, boost efficiency, and accelerate AI initiatives.
About the Authors
Agustin Malanco is currently working as an Emerging Workloads Solutions Architect for the Office of the CTO. He has worked for VMware since 2011 across many different roles, ranging from Pre-sales, multiple Solution Architecture positions and as a Technical Product Manager. Agustin has 13+ years of professional experience and a proven track record across multiple VMware products and datacenter technologies. He is currently one of the 6 individuals in the world to hold 3 VCDX (#141) certifications and serves as a VCDX Panelist.
Enrique Corro has been with VMware for 17 years and has a master’s degree in data science from the University of Illinois. He currently works as a staff engineer at VMware’s Office of the CTO. Enrique focuses on helping VMware customers run their ML workloads on VMware technologies. He also works in different initiatives for ML adoption within VMware products, services, and operations.