Kubeflow Validation

Kubeflow Function Validation

Introduction

Kubeflow allows a notebook-based modeling system to easily integrate with the data preparation on a local data lake or in the cloud in a similar way. Kubeflow supports multi-tenant machine learning environments by managing the container orchestration aspect of the infrastructure that enables simple and effective sharing.

We validated the core functions from Notebooks to Pipelines and model serving and showcased an integrated end-to-end Pipeline example:

  • Kubeflow Notebooks
  • Run TensorFlow example
  • Run PyTorch example
  • Run Pipeline example
  • KServe inference example
  • End-to-end Pipeline example

Kubeflow Notebooks

Kubeflow Notebooks provides a way to run web-based development environments inside your Kubernetes cluster by running inside pods. It provides several default images. System administrators can provide customized notebook images for their organization with required packages pre-installed.

Access control is managed by Kubeflow’s RBAC, enabling easier notebook sharing across the organization. Users can create notebook containers directly in the cluster.

Creating a Kubeflow Notebook

Data scientists can create notebook servers for their data preparation and model development.

To spin up a notebook, perform the following steps:

Click the Central Dashboard Notebooks tab and click New Notebook:

Graphical user interface, application</p>
<p>Description automatically generated

Figure 1: New Notebook Wizard

Note: Kubeflow uses “limits” in pod requests to provision GPUs onto the notebook pods (details about scheduling GPUs can be found in the Kubernetes Documentation). If we want to enable GPU on your notebook, in the GPU drop-down list, specify any “GPU Vendor” devices that your notebook server requests. In our environment, as Figure 1 shows, we select NVIDIA GPU.

We can configure a ReadWriteMany persistent volume according to Using ReadWriteMany Volumes on TKG Clusters. See example here.

Note: RWM volume is not natively supported with vSAN File Services in the current version.

Figure 2 shows the “external-nfs-pvc” volume in the Volumes Web UI, which is provisioned in the configuration section, also other ReadWriteOnce volumes in the user namespace are in the list.

Graphical user interface, application</p>
<p>Description automatically generated

Figure 2: RWM Volume Backed by vSAN File Service

We can attach the existing RWM volume to the new notebook.

Graphical user interface, application</p>
<p>Description automatically generated

Figure 3: Attach the Existing Volume to New Notebook

For more information, see the notebooks quickstart guide.

Run TensorFlow Example

We use the BERT for TensorFlow Jupyter Notebook for testing. Bidirectional Embedding Representations from Transformers (BERT) is a method of pre-training language representations, which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. NVIDIA's BERT is an optimized version of Google's official implementation. The notebook provides a worked example for utilizing the BERT for TensorFlow model scripts.

After deploying the tensorflow-cuda image notebook, click on CONNECT, since the scripts are based on TensorFlow 1.15 version, either change some of the deprecated API to new ones or build a customized image on the same tensorflow version to make the code pass.

We chose the notebook server image with tensorflow+cuda 11.
Graphical user interface, application, Teams</p>
<p>Description automatically generated

Follow the steps to run an example use case of the BERT model for end-user applications.

Figure 4 shows inference using GPU.
Graphical user interface, text, application</p>
<p>Description automatically generated

Figure 4: BERT_Jupyter Notebook using GPU

Figure 5  is an example prediction result for using the BERT for TensorFlow.
Text</p>
<p>Description automatically generated with medium confidence

Figure 5: Prediction Result

Run Pytorch YOLOV5 Example

We use YOLOV5 to verify the inference and validation, which is a family of object detection architectures and models pre-trained on the COCO dataset.

Notes: Require customized image to have pycocotools library installed (which needs gcc library installed, this is not included in the default kubeflow notebook images).

In our validation, the GPU is NVDIA A100, we installed pytorch with cuda v 11.3. In the Notebook, we first installed below:

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

pip3 install pycocotools

Then we followed the tutorial notebook to run the validation and inference case.

Text</p>
<p>Description automatically generated

Figure 6: YOLOV5 Validation on coco Dataset Screenshot Using A100 MIG

A picture containing text, person, screenshot</p>
<p>Description automatically generated

Figure 7: YOLOV5 Inference Screenshot

Run Pipeline Example

A Kubeflow Pipeline is a portable and scalable definition of a machine learning workflow, based on containers. Kubeflow Pipelines are reusable end-to-end machine learning workflows composed of a set of input parameters and a list of the steps using the Kubeflow Pipelines SDK.

You can follow https://www.kubeflow.org/docs/components/pipelines/tutorials/build-pipeline/ to upload a compiled pipeline.

Kubeflow Pipelines offers a few samples that you can use to try out the pipelines quickly.

To run a basic pipeline, perform the following steps:

  1. From the Kubeflow Pipeline UI

Click the name of the sample XGBoost-iterative model training in Figure 8, the source code is

https://github.com/kubeflow/pipelines/tree/master/samples/core/train_until_good

Graphical user interface, diagram</p>
<p>Description automatically generated

Figure 8: Example XGBoost Training Pipeline

A component in a pipeline can be responsible for data preprocessing, data transformation, model training, and so on.

The Artifacts include Pipeline packages, views, and large-scale metrics (time series). Use large-scale metrics to debug a pipeline run or investigate an individual run’s performance. Kubeflow Pipeline installation stores the artifacts in an artifact store Minio server by default. Below is the pipeline running log stored in Artifacts:

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Figure 9: Artifact Stores in MinIO Server

The lineage explorer displays the running flow of pipeline components:

Graphical user interface</p>
<p>Description automatically generated

Figure 10: Artifacts for a Pipeline Running Log Lineage Explorer

From the Kubeflow-user-example-com namespace, we can also see the pipeline pod’s status is completed.

Text</p>
<p>Description automatically generated

Figure 11: Pipeline Pods Status Become Completed

For more details, see Kubeflow pipeline introduction.

KServe Inference Example

KServe enables serverless inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to handle production model serving use cases. For more details, visit the KServe website.

KServe provides a simple Kubernetes CRD to allow deploying single or multiple trained models onto model servers such as TFServing, TorchServe, ONNXRuntime, and Triton Inference Server. See samples for more information.

We validated the basic inference service which loads a simple iris machine learning model, sends a list of attributes, and prints the prediction for the class of iris plant, see the YAML file.

kubectl apply -f isvc.yaml -n kubeflow-user-example-com

The inference service will be ready as the figure shows.

Graphical user interface, text, application</p>
<p>Description automatically generated

Text</p>
<p>Description automatically generated

Figure 12: inference Service Becomes Ready

You can also check the inference service from the Model Servers tab in Figure 13.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Figure 13: Deploy the Inference Service in the Model Servers Tab

Figure 14 is the screenshot of Model server details including service URL, Storage URI, and Predictor type.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Figure 14: Inference Service Details

Figure 15 is an example notebook to do prediction using the deployed inference service.

Graphical user interface, text, application, email</p>
<p>Description automatically generated

Figure 15: Call Inference Service for Prediction Example

Check the examples running KServe on Istio/Dex to access the endpoint outside the cluster.

End-to-End Pipeline Example

We validated an integrated MNIST end-2-end pipeline test to perform the following tasks:

  • Hyperparameter tuning using Katib
  • Distributive training with the best hyperparameters using TFJob
  • Serve the trained model on local pvc using KServe

Before validation, make sure to set a default storageclass and install the python libraries in the requirements.txt, also you can set the parameters in the settings.py.

As shown in Figure 16, start the pipeline ./runner.sh

Figure 16: Kickoff the E2E Pipeline

We can monitor the pipeline running from the central dashboard.Graphical user interface, application</p>
<p>Description automatically generated

Figure 17: E2E mnist Pipeline Graph

Figure 17 shows the component running steps.

The first step is the Experiments to tune Hyperparameter using Katib. The Experiment uses a "random" algorithm and TFJob for the Trial's worker.

Chart, radar chart</p>
<p>Description automatically generated

Figure 18: Katib AutoML Hyperparameter Tuning

Then the pipeline created a pvc to store the model. Next is the TFJob runs the Chief and Worker with 1 replica, and last is serving the model using the KServe inference service. And the pipeline runs status changed from running to success.

Graphical user interface, application, Teams</p>
<p>Description automatically generated

Figure 19:  Pipeline Execution Steps and Status

 

Filter Tags

Document