Tanzu Kubernetes Grid Service Troubleshooting Deep Dive - Part 1
DevOps teams get on-demand access to Kubernetes clusters through the Tanzu Kubernetes Grid Service (TKG) by submitting a declarative yaml specification to the Supervisor cluster Kubernetes API. The TKG Service reconciles that specification into a set of VMs configured as Kubernetes nodes joined together to form a Tanzu Kubernetes Grid cluster (TKC). The formation of the cluster is a fairly complicated set of interactions between custom resources created in the Supervisor cluster, custom controllers that act on those resources and provider interfaces to implement the underlying compute, networking and storage for the cluster. For the VI Admin whose day job is not Kubernetes, where to look for failure information and how to interpret it can be overwhelming. Because TKG is using a Kubernetes cluster (Supervisor cluster) to lifecycle manage clusters that DevOps can spin up on their own (TanzuKubernetesGrid clusters or TKC), troubleshooting requires some understanding of how the Supervisor cluster is being used - and the Kubernetes constructs that support it.
The Kubernetes API
The API is the core of the Kubernetes control plane and exposes a REST API that DevOps can use to interact with built-in Kubernetes objects. These built-in objects, like Deployments, Pods, Services, ConfigMaps or secrets are associated with Controllers that are responsible for ensuring that the actual state of its objects are in sync with the desired state defined in the object specification. More information on working with Kubernetes objects can be found HERE. The API provides more than just interactions with standard objects. It is extensible. Developers can create Custom Resources by defining new endpoints in the API that provide customization that is not available in the default Kubernetes deployment. Once the resources are installed, users interact with them through kubectl in the same way they interact with built-in resources. By themselves Custom Resources are not particularly useful. They store structured information obtained from the specification that was submitted. Custom Controllers can be written that reconcile those custom resources into something interesting. As an example, the VMware VM Service defines a custom resource (CRD) called a VirtualMachine. Users can submit a yaml specification to the API with KIND VirtualMachine and the API will create it. There is a custom controller that watches for the creation of VirtualMachine objects and makes API calls to vSphere to reconcile that resource into an actual VM with the configuration defined in the custom resource object. A more expansive introduction to custom resources in the Kubernetes API can be found HERE.
Cluster API is an open source Kubernetes sub-project that brings a declarative API to the lifecycle management of Kubernetes clusters. This means that DevOps can use their application workload patterns to deploy and manage their Kubernetes clusters. Cluster API also defines a set of Provider interfaces that allow plug and play of underlying infrastructure in support of the cluster. Things like virtual machines, networks, storage or load balancers can be defined through the spec and reconciled on the developers preferred platform. The Tanzu Kubernetes Grid (TKG) Service is an implementation of Cluster API that supports vSphere and other cloud platforms as providers. Fundamental to a Cluster API implementation is that you need to have an existing Kubernetes cluster - referred to as the Management cluster. The Management cluster contains all of the Custom Resource Definitions (CRDs) and controllers used for reconciliation of an instantiated TKC cluster. The Supervisor Cluster is the Cluster API Management cluster for the TKG service. Detailed information on the Cluster API project is available HERE.
Tanzu Kubernetes Grid Service
Deployment of a new cluster starts with the TanzuKubernetesCluster (TKC) custom resource and TKG controller. DevOps users submit a TKC specification and the TKG controller generates specifications for the custom resources defined in Cluster API. Those resources are reconciled into the specifications needed to create the appropriately configured VMs and to push OS configuration into the cluster nodes to set them up as Kubernetes cluster nodes. What is important to know at this point is that the specifications are pushed down from the TKC resource into the appropriate lower level resources and those lower level resources surface status information in summary form back up the stack. So troubleshooting starts with the TKC resource and then drives into the lower level resources for more detail. An overview of those resources and the cluster creation milestones will be in Part 2 of this series.
For more information and to see how this works, check out the first video in the Tanzu Kubernetes Grid Troubleshooting deep dive series:
Quick Links to Entire Troubleshooting Blog/Video Series.
Abstract: Define Kubernetes custom resources, Cluster API and show how they are used to Lifecycle TKG clusters. Focus on how that impacts troubleshooting
Abstract: Show how the TKC is decomposed into a set of resources that are an implementation of Cluster API. Focus on how that impacts troubleshooting
Abstract: Cluster creation milestones and identify common failures and remediation at each level
Special Thanks to Winnie Kwon, VMware Senior Engineer. Her engineering documents and willingness to answer many questions were the basis for the creation of this series of blogs/videos.