Kubeflow Best Practice

Best Practices

The following recommendations provide the best practices and sizing guidance to run Kubeflow on the AI-Ready platform on vSphere 7 with Tanzu.

  • Tanzu Kubernetes Grid:
    • Start with a smaller size of Tanzu Kubernetes cluster with fewer GPU worker nodes, since Kubeflow component pods do not consume GPU resources, and limited CPU resources. The NVIDIA GPU Operator automatically manages newly added GPU worker nodes. We can dynamically resize a Tanzu Kubernetes cluster with more GPU worker nodes and non-GPU worker nodes if there are more workloads running in the system.
    • Customize and pre-allocate enough CPU and memory resources for the Tanzu Kubernetes cluster. Refer to Performance Best Practices for Kubernetes with VMware Tanzu for sizing guidance for Tanzu Kubernetes Grid.
  • vSAN Storage:
    • Using the vSAN file service for ReadWriteMany Persistent Volumes can easily scale out the file share and the security, failure tolerance, performance, and capacity-saving features. This architecture can also be easily balanced by manipulating the storage policy of the file share.
    • Failures to Tolerate (FTT) is recommended to set to 1 failure – RAID 1 (Mirroring), if considering space saving, use RAID 5, use stripe policy for a large file share.
    • Enable vSAN Trim/Unmap to allow space reclamation for persistent volumes.
  • Kubeflow:
    • Use the latest stable version and match the Kubernetes cluster version and related tools version.
    • Request enough CPU and RAM resources for notebooks or pods to run machine learning workload if the workload is resource intensive.
    • If Kubeflow is deployed in a restricted internet access environment, it is recommended to use a private registry.
    • For GPU-enabled jobs, the CUDA version may not be compatible, so you may need to build a matching image for your cluster.
    • Kubeflow is a loosely-coupled platform. You can use individual components to serve your specific needs in the machine learning workflow.

Check out the solution Home Page for more information.

Previous page: Kubeflow Validation

Next page: Kubeflow Reference

Filter Tags

Document