Kubeflow Technical Component

Technology Overview

The technology components in this solution are:

VMware vSphere
VMware Tanzu Kubernetes Grid
VMware vSAN File Service
Kubeflow

VMware vSphere

VMware vSphere is industry’s leading virtualization and workload platform, vSphere 7 brings efficiency, scale, and security to AI and modern applications. AI infrastructure is now a part of a managed environment within IT to provision specific GPU accelerators, compute, storage, and network resources for AI workload needs.

vSphere 7 delivers powerful support for the most modern GPUs such as NVIDIA Ampere-based A100 GPUs, including enhancements to performance boosting GPUDirect communications, vSphere also supports NVIDIA Multi-Instance GPU (MIG) technology to allow for partitioning of GPUs, which further increases utilization while strictly separating the virtual machines (VMs) sharing the GPU hardware.

With vSphere 7, developers and DevOps teams can use Kubernetes commands to provision VMs on hosts or Tanzu Kubernetes Grid clusters with vGPUs. This will help customers build and run their AI apps on GPU-enabled hardware using a self-service model. customers will have at their fingertips the power to build scalable AI applications.

VMware Tanzu Kubernetes Grid

VMware Tanzu Kubernetes Grid (TKG) provides organizations with a consistent, upstream-compatible, regional Kubernetes substrate that is ready for end-user workloads and ecosystem integrations. You can deploy Tanzu Kubernetes Grid across software-defined datacenters (SDDC) and public cloud environments, including vSphere, Microsoft Azure, and Amazon EC2.

Tanzu Kubernetes Grid provides the services such as networking, authentication, ingress control, and logging that a production Kubernetes environment requires. It can simplify operations of large-scale, multi-cluster Kubernetes environments, and keep your workloads properly isolated. It also automates lifecycle management to reduce your risk and shift your focus to more strategic work.

This document describes the use of VMware Tanzu Kubernetes Grid Service to support machine learning workloads that are distributed across the nodes and servers in the cluster. The Tanzu Kubernetes Grid Service provides self-service lifecycle management of Tanzu Kubernetes clusters. You use the Tanzu Kubernetes Grid Service to create and manage Tanzu Kubernetes clusters in a declarative manner that is familiar to Kubernetes operators and developers.

VMware vSAN File Service

vSAN helps reduce the complexity of monitoring and maintaining infrastructure and enables administrators to rapidly provision a file share in a single workflow for Kubernetes-orchestrated cloud native applications. See VMware vSAN doc and VMware vSAN 7.0 Update 3 Release Notes for more information.

vSAN File Services is a layer that sits on top of vSAN to provide file sharing services. It currently supports SMB, NFSv3, and NFSv4.1 file shares. vSAN File Service brings in the capability to host the file shares directly on the vSAN cluster. See vSAN File Services.

The NFS feature of the vSAN File service was used to provide ReadWriteMany (RWM) volumes for this solution.

Kubeflow

Kubeflow is a free and open-source end-to-end machine learning platform designed to enable machine learning pipelines to orchestrate complicated workflows running on Kubernetes. Kubeflow provides components for each stage in the machine learning lifecycle, from exploration through to training and deployment.

This drawing is courtesy of the Kubeflow project website.

Figure 1: Kubeflow Application

Table 1 lists the main pillars of Kubeflow.

Table 1 Kubeflow Main Pillars

Component Name	Description
Central Dashboard	The central user interface (UI) in Kubeflow.
Kubeflow Notebooks	Kubeflow Notebooks provides a way to run web-based development environments inside your Kubernetes cluster by running them inside pods.
Kubeflow Pipelines	Documentation for Kubeflow pipelines
Katib	Katib is a project that is agnostic to machine learning frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many machine learning frameworks, such as TensorFlow, MXNet, PyTorch, XGBoost, and others.
Training Operators	Training of machine learning models in Kubeflow through operators.
Kserve	Kserve allows you to serve your models as scalable APIs effortlessly and even do canary releases.
Multi-Tenancy	Multi-user isolation and identity access management (IAM)

These Kubeflow components can support multi-user isolation: central dashboard, notebooks, pipelines, AutoML (Katib), KServe. Furthermore, resources created by the notebooks (for example, training jobs and deployments) also inherit the same access.

Kubeflow can organize loosely-coupled microservices as a single unit and deploy them to a variety of locations, including on a laptop, on-premises, or in the cloud. It is a platform for data scientists to build and experiment with machine learning pipelines, also for machine learning engineers and operational teams who want to deploy machine learning systems to various environments for development, testing, and production-level serving. See kubeflow website for more information.

Check out the solution Home Page for more information.

Previous page: Kubeflow Introduction

Next page: Kubeflow Configuration

Filter Tags

Document