Kubeflow Configuration
Configuration
Architecture
The Tanzu Kubernetes cluster was provisioned on top of vSphere consisting of multiple worker nodes, where each node is implemented as a virtual machine. Worker nodes that did not have a vGPU associated with them, were used for Kubeflow components. Worker nodes equipped with a vGPU are for pod deployment with GPU requirements. The NVIDIA GPU operator v1.9.1 was installed in the Tanzu Kubernetes cluster to allow users to manage the GPU nodes in the cluster. One ReadWriteMany (RWM) persistent volume from the vSAN file service was configured for shared data.
Figure 1: Solution Architecture
Hardware Resource
Server
A minimum of three servers that are approved on both the VMware Hardware Compatibility List and the NVIDIA Virtual GPU Certified Servers List are required.
GPU
A minimum of one NVIDIA GPU installed in one of the servers:
- Ampere class GPU (A100, A30, A0, or A10) (A100 and A30 are MIG capable, recommended, A40 is mainly focused on graphics)
- Turing class GPU (T4)
- Additional supported GPUs can be found here
In our validation environment, we used the following GPU resource: 1 x NVIDIA Ampere A100 40GB PCIe/server.
PROPERTY |
SPECIFICATION |
Server Model |
Dell VxRail P670F |
CPU |
2 x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, 28 core each |
RAM |
512GB |
Network Resources |
1 x Intel(R) Ethernet Controller E810-XXV, 25Gbit/s, dual ports 1 x NVIDIA ConnectX-5 Ex, 100Gbit/s dual ports |
Storage Resources |
1 x Dell HBA355i disk controller 2 x P5600 1.6TB as vSAN Cache Devices 8 x 3.84TB Read Intensive SAS SSDs as vSAN Capacity Devices |
GPU Resources |
1 x NVIDIA Ampere A100 40GB PCIe |
Software Resource
Table 1: Software List
Software |
Version |
vSphere |
7.0 update 3c |
Tanzu Kubernetes Release |
v1.20.8+vmware.2 |
NVAIE |
1.1 |
Kubeflow |
v1.5 |
Helm |
3.7.2 |
Network Design
The 25GbE NICs were used for vSphere management, vMotion, vSAN, and the Tanzu Kubernetes Grid management network. The 100GbE NICs were used for the vSAN file service and the Tanzu Kubernetes Grid workload network. In this case, the workload cluster was physically separated from the management and vSAN network. The workload cluster used the higher network bandwidth for both node-to-node interactions and read or write data on the vSAN file share.
Figure 2: Network
vSphere Configuration
In this solution, a vSphere cluster should be pre-configured with vSAN enabled, and the ESXi hosts in the cluster should have NVIDIA GPUs installed.
Enable vGPU on ESXi Hosts
The vSphere administrator can follow this article to install the NVIDIA Virtual GPU Manager from NVAIE 1.1 package and enable vGPU support on the ESXi hosts that have GPU installed.
Configure vSAN File Service for Network File System (NFS)
With the vSAN file service enabled, we can create native vSAN NFS File shares without extra storage on the vSphere cluster. Most machine learning platforms need a data lake, a centralized repository to store all the structured and unstructured data. The Tanzu Kubernetes cluster can be configured with an NFS-backed ReadWriteMany (RWM) persistent volume across the pods to share and store data.
In Cluster Configure->vSAN->File Service, click Enable and follow the wizard to enable the File Service.
Figure 4: Configure File Service
In this solution, we created a vSAN NFS file share MLData with a size of 1TB. The storage policy was configured with RAID 1 with StripeWidth=8 to guarantee the best performance by distributing the data across all the vSAN disk groups while not compromising data resiliency.
Figure 5: NFS File Share
Note: RWM volume is not natively supported with the vSAN File Services in the current version. We can configure an RWM persistent volume according to Using ReadWriteMany Volumes on TKG Clusters. See the example here.
For more information regarding vSAN file service, visit the link here.
Provision the Tanzu Kubernetes Cluster
While provisioning the Tanzu Kubernetes cluster, we defined the control planes and worker nodes as follows.
Table 3 Tanzu Kubernetes Cluster Definition
Role |
Replicas |
Storage Class |
VM Class |
Tanzu Kubernetes Release (TKR) |
Control Plane |
3 |
vsan-r1 |
best-effort-small |
v1.20.8---vmware.1-tkg.2 |
GPU Worker Nodes |
6 |
vsan-r1 |
gpuclass-a100 |
v1.20.8---vmware.1-tkg.2 |
Non-GPU Worker Nodes |
3 |
vsan-r1 |
best-effort-xlarge |
v1.20.8---vmware.1-tkg.2 |
Additionally, for each of the worker nodes, we configured a 50GB storage volume for container and a 50GB volume for the kubelet. The YAML file we used in this example for Tanzu Kubernetes cluster deployment can be found here.
Configure a Node in a Tanzu Kubernetes Cluster with vGPU Access
To configure Tanzu Kubernetes cluster with vGPU access, the vSphere administrator should first enable Workload Management in the vSphere Client, create the supervisor cluster and content library that will be subscribed to https://wp-content.vmware.com/v2/latest/lib.json, create the VM classes with vGPU access and create a new namespace with the VM classes configured with vGPU. Visit the link here for more details on the procedures and steps involved in this section.
In this solution, we configured the Tanzu Supervisor Cluster with haproxy v0.2.0 for load balancing. We added the pre-defined best-effort-small, best-effort-large, and best-effort-2xlarge VM classes to the namespace. To give the worker nodes vGPU access, we created and added a VM class (named gpuclass-a100) with the following specifications to the namespace:
Figure 3: VM Class Details
From the storage perspective, we added two vSAN storage policies to the namespace, one was vsan-r1 that is with RAID 1 configured, the other was stripe that is configured with RAID 5 and StripeWidth=8 to maximize the performance for the Tanzu Kubernetes cluster worker nodes.
Install the NVIDIA GPU Operator
After the Tanzu Kubernetes cluster is up and running, the developer logs into the Tanzu Kubernetes cluster that was created and follows the instructions in this link to install NVIDIA GPU operator on the Tanzu Kubernetes cluster. The installation process needs the developer to provide the NVIDIA CLS or DLS license token and the NGC account information. Refer to the NVIDIA Licensing Guide here.
In this solution, we installed GPU operator v1.9.1 and the script we used for installing NVIDIA GPU operator can be found here.
Monitoring Tools
Kubeflow Central Dashboard
The Kubeflow central dashboard provides quick access to the Kubeflow components deployed in your cluster where you can see a list of recent pipelines, notebooks, metrics, and an overview of your jobs as they are processed. See Central Dashboard to learn more.
vSAN Performance Service
The vSAN Performance Service is for monitoring the performance of the vSAN environment and helping users to investigate potential problems. The performance service collects and analyzes performance statistics and displays the data in a graphical format. You can use the performance charts to manage your workload and determine the root cause of problems.
Check out the solution Home Page for more information.
Previous page: Kubeflow Technical Component
Next page: Kubeflow Deployment