A typical machine learning (ML) workflow usually includes stages such as data verification, feature engineering, model training, and deployment in a scalable fashion. Kubeflow provides a collection of cloud native components for developing and automating and maintaining all the stages of the ML process in a Kubernetes cluster either on-premises or in the cloud.
VMware vSphere 7 delivers Artificial Intelligence (AI) and Developer-Ready infrastructure, scales without compromise, and simplifies operations, is helping in the adoption of AI in the enterprise. VMware and NVIDIA AI-Ready Enterprise software suite is an end-to-end cloud-native suite of AI tools and frameworks, optimized and exclusively certified by NVIDIA to run on VMware vSphere. This software suite handles the complexity associated with AI and ML efforts, giving organizations the confidence to update their infrastructure for AI and utilize AI to transform their business.
This paper provides a general design and deployment guidance for running Kubeflow on VMware vSphere® 7 with VMware Tanzu® Kubernetes Grid™ with GPU access empowered by NVIDIA Artificial Intelligence Enterprise (NVAIE). We will also validate the core component functions to demonstrate that Kubeflow enables repeatable and reproducible machine learning workflows that can be shared between different teams such as data scientists, machine learning engineers, and DevOps.