Kubeflow Configuration

Configuration

Architecture

The Tanzu Kubernetes cluster was provisioned on top of vSphere consisting of multiple worker nodes, where each node is implemented as a virtual machine. Worker nodes that did not have a vGPU associated with them, were used for Kubeflow components. Worker nodes equipped with a vGPU are for pod deployment with GPU requirements. The NVIDIA GPU operator v1.9.1 was installed in the Tanzu Kubernetes cluster to allow users to manage the GPU nodes in the cluster. One ReadWriteMany (RWM) persistent volume from the vSAN file service was configured for shared data.

Figure 1: Solution Architecture

Hardware Resource

Server

A minimum of three servers that are approved on both the VMware Hardware Compatibility List and the NVIDIA Virtual GPU Certified Servers List are required.

GPU

A minimum of one NVIDIA GPU installed in one of the servers:

  • Ampere class GPU (A100, A30, A0, or A10) (A100 and A30 are MIG capable, recommended, A40 is mainly focused on graphics)
  • Turing class GPU (T4)
  • Additional supported GPUs can be found here

In our validation environment, we used the following GPU resource: 1 x NVIDIA Ampere A100 40GB PCIe/server.

PROPERTY

SPECIFICATION

Server Model

Dell VxRail P670F

CPU

2 x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, 28 core each

RAM

512GB

Network Resources

1 x Intel(R) Ethernet Controller E810-XXV, 25Gbit/s, dual ports

1 x NVIDIA ConnectX-5 Ex, 100Gbit/s dual ports

Storage Resources

1 x Dell HBA355i disk controller

2 x P5600 1.6TB as vSAN Cache Devices

8 x 3.84TB Read Intensive SAS SSDs as vSAN Capacity Devices

GPU Resources

1 x NVIDIA Ampere A100 40GB PCIe

Software Resource

Table 1: Software List

Software

Version

vSphere

7.0 update 3c

Tanzu Kubernetes Release

v1.20.8+vmware.2

NVAIE

1.1

Kubeflow

v1.5

Helm

3.7.2

Network Design

The 25GbE NICs were used for vSphere management, vMotion, vSAN, and the Tanzu Kubernetes Grid management network. The 100GbE NICs were used for the vSAN file service and the Tanzu Kubernetes Grid workload network. In this case, the workload cluster was physically separated from the management and vSAN network. The workload cluster used the higher network bandwidth for both node-to-node interactions and read or write data on the vSAN file share.

Figure 2: Network

vSphere Configuration

In this solution, a vSphere cluster should be pre-configured with vSAN enabled, and the ESXi hosts in the cluster should have NVIDIA GPUs installed.

Enable vGPU on ESXi Hosts

The vSphere administrator can follow this article to install the NVIDIA Virtual GPU Manager from NVAIE 1.1 package and enable vGPU support on the ESXi hosts that have GPU installed.

Configure vSAN File Service for Network File System (NFS)

With the vSAN file service enabled, we can create native vSAN NFS File shares without extra storage on the vSphere cluster. Most machine learning platforms need a data lake, a centralized repository to store all the structured and unstructured data. The Tanzu Kubernetes cluster can be configured with an NFS-backed ReadWriteMany (RWM) persistent volume across the pods to share and store data.

In Cluster Configure->vSAN->File Service, click Enable and follow the wizard to enable the File Service.

Graphical user interface, text, application</p>
<p>Description automatically generated

Figure 4: Configure File Service

In this solution, we created a vSAN NFS file share MLData with a size of 1TB. The storage policy was configured with RAID 1 with StripeWidth=8 to guarantee the best performance by distributing the data across all the vSAN disk groups while not compromising data resiliency.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Figure 5: NFS File Share

Note:  RWM volume is not natively supported with the vSAN File Services in the current version. We can configure an RWM persistent volume according to Using ReadWriteMany Volumes on TKG Clusters. See the example here.

For more information regarding vSAN file service, visit the link here.

Provision the Tanzu Kubernetes Cluster

While provisioning the Tanzu Kubernetes cluster, we defined the control planes and worker nodes as follows.

Table 3 Tanzu Kubernetes Cluster Definition

Role

Replicas

Storage Class

VM Class

Tanzu Kubernetes Release (TKR)

Control Plane

3

vsan-r1

best-effort-small

v1.20.8---vmware.1-tkg.2

GPU Worker Nodes

6

vsan-r1

gpuclass-a100

v1.20.8---vmware.1-tkg.2

Non-GPU Worker Nodes

3

vsan-r1

best-effort-xlarge

v1.20.8---vmware.1-tkg.2

Additionally, for each of the worker nodes, we configured a 50GB storage volume for container and a 50GB volume for the kubelet. The YAML file we used in this example for Tanzu Kubernetes cluster deployment can be found here.

Configure a Node in a Tanzu Kubernetes Cluster with vGPU Access

To configure Tanzu Kubernetes cluster with vGPU access, the vSphere administrator should first enable Workload Management in the vSphere Client, create the supervisor cluster and content library that will be subscribed to https://wp-content.vmware.com/v2/latest/lib.json, create the VM classes with vGPU access and create a new namespace with the VM classes configured with vGPU. Visit the link here for more details on the procedures and steps involved in this section.

In this solution, we configured the Tanzu Supervisor Cluster with haproxy v0.2.0 for load balancing. We added the pre-defined best-effort-small, best-effort-large, and best-effort-2xlarge VM classes to the namespace. To give the worker nodes vGPU access, we created and added a VM class (named gpuclass-a100) with the following specifications to the namespace:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Figure 3: VM Class Details

From the storage perspective, we added two vSAN storage policies to the namespace, one was vsan-r1 that is with RAID 1 configured, the other was stripe that is configured with RAID 5 and StripeWidth=8 to maximize the performance for the Tanzu Kubernetes cluster worker nodes.

Install the NVIDIA GPU Operator

After the Tanzu Kubernetes cluster is up and running, the developer logs into the Tanzu Kubernetes cluster that was created and follows the instructions in this link to install NVIDIA GPU operator on the Tanzu Kubernetes cluster. The installation process needs the developer to provide the NVIDIA CLS or DLS license token and the NGC account information. Refer to the NVIDIA Licensing Guide here.

In this solution, we installed GPU operator v1.9.1 and the script we used for installing NVIDIA GPU operator can be found here.

Monitoring Tools

Kubeflow Central Dashboard

The Kubeflow central dashboard provides quick access to the Kubeflow components deployed in your cluster where you can see a list of recent pipelines, notebooks, metrics, and an overview of your jobs as they are processed. See Central Dashboard to learn more.

vSAN Performance Service

The vSAN Performance Service is for monitoring the performance of the vSAN environment and helping users to investigate potential problems. The performance service collects and analyzes performance statistics and displays the data in a graphical format. You can use the performance charts to manage your workload and determine the root cause of problems.

 

Filter Tags

Document