July 20, 2021

Tanzu Kubernetes Grid Service Troubleshooting Deep Dive - Part 2

This is the second in a series of blogs and supporting videos that dives into the components of the TKG Service and provides a roadmap for troubleshooting failures. The focus here is cluster API and Virtual machine resources that are generated through the cluster creation milestones and will serve as a baseline for the troubleshooting videos that follow.

The Tanzu Kubernetes Grid (TKG) cluster is made up of a set of Virtual Machines configured as Kubernetes nodes and joined into a cluster.  The creation and desired state maintenance of those clusters involves a hierarchy of Kubernetes custom resources and controllers deployed into the Supervisor cluster.   The first blog/video in this TKG Troubleshooting series focused on an introduction to Kubernetes custom resources and the Cluster API open Source project, along with the creation of the Tanzu Kubernetes Cluster (TKC) resource.   Part 1 blog/video is available here.   The TKC and associated TKG Controller are the top level in the hierarchy of resources that reconcile the TKG cluster.   This part 2 blog/video will look at how the TKC is decomposed into a set of resources that are an implementation of Cluster API.  It will also look at how the ClusterAPI resources are further reconciled into Virtual Machine resources that hold the configuration required to create VMs on vSphere and form a Kubernetes cluster.   To jump directly to the Part 2 video click Here

 

Hierarchy of Custom Resources and Controllers

 

image 107

The simplified view is that the Tanzu Kubernetes Cluster (TKC) resource is reconciled by the TKG Controller into a set of resources or objects that are an implementation of Cluster API.  The TKG Controller also creates the vSphere Provider specific Virtual Machine Resources.   The reconciliation process for all of these resources encompasses a set of controllers that are responsible for monitoring the health and updating appropriate configuration and status for the particular set of resources they watch.  As we drill into troubleshooting in the next video in the series, you will see that each of the controllers has a log of its activities.  Transitions that happen to individual resources can be found by describing that resource.  Additionally we have implemented a pattern where lower level activity is reflected into higher level objects in summary form.  So troubleshooting starts by describing the TKC resource and potentially viewing the log for the TKG controller.  

 

TKG Controller Creates Cluster API Objects

image 119

 

TKG reconciles the TKC resource into a set of yaml specifications that define the Cluster API custom resources.  The objects are logically grouped into Cluster, Control Plane and Workers.  Further, Cluster API separates the generic specification that would be independent of the particular provider platform (vSphere, Azure, AWS, etc) from, in our case, vSphere specific configuration.  The Cluster object contains the definition of the cluster itself and makes reference to WCPCluster and KubeadmControlPlane objects which hold vSphere specific configuration and the KubeAdm code to turn the virtual machines into Kubernetes nodes.  The WCPMachineTemplate contains the definition for the underlying Virtual Machines that will become the control plane nodes.  Note that WCP stands for Workload Control Plane and is the engineering name for the vCenter service that implements the Supervisor Cluster.  More interesting from a troubleshooting perspective than the Spec, is the Status section of the objects.   The Cluster reports on the status of ControlPlane and Infrastructure availability, while the WCPCluster is focused on the specifics of Infrastructure availability - particularly Networking and Load Balancing.   So if you saw an infrastructure error at the TKC resource level or the Cluster resource level, you might check the WCPCluster resource for more detail. 

 

Cluster

Status:
  Conditions:
    Last Transition Time:     2021-07-13T18:46:40Z
    Status:                   True
    Type:                     Ready
    Last Transition Time:     2021-07-13T18:46:40Z
    Status:                   True
    Type:                     ControlPlaneReady
    Last Transition Time:     2021-07-13T18:38:15Z
    Status:                   True
    Type:                     InfrastructureReady
  Control Plane Initialized:  true
  Control Plane Ready:        true
  Infrastructure Ready:       true
  Observed Generation:        3
  Phase:                      Provisioned

WCPCluster

Status:
  Conditions:
    Last Transition Time:  2021-07-13T18:38:14Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2021-07-13T18:38:06Z
    Status:                True
    Type:                  ClusterNetworkReady
    Last Transition Time:  2021-07-13T18:38:14Z
    Status:                True
    Type:                  LoadBalancerReady
    Last Transition Time:  2021-07-13T18:38:06Z
    Status:                True
    Type:                  ResourcePolicyReady
  Ready:                   true
  Resource Policy Name:    tkg-cluster

 

The Control Plane objects are the next level of detail that make up the Cluster.   Each of the control plane nodes is a Machine in Cluster API.  There will be a generic Machine resource and a provider specific Machine resource called WCPMachine for each control plane node.  There is also a KubeAdmConfig resource that holds an abstraction of the Kubeadm config that will be used to set up the nodes as Kubernetes clusters.  Worker nodes are configured in a very similar process to Control Plane nodes.   

image 118

Let's check the Status of the Control Plane Machine by issuing the kubectl describe machine "ControlPlaneNodeName" command.   This is where you can see that the Infrastructure is Ready and each of the Kubernetes components are Healthy (Controller Manager, API, EtcD, Scheduler, etc).  Errors that show up in Infrastructure would necessitate a look at the corresponding WCPMachine.   Notice that the WCPMachine status contains the VM ip and the condition of the Virtual Machine.   As we will see in a moment, this information was reflected back to this resource from the VirtualMachine resource that is the source of truth for the Virtual Machine.   

Control Plane Machine

Status:
  Bootstrap Ready:  true
  Conditions:
    Last Transition Time:  2021-07-13T18:45:12Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2021-07-13T18:56:53Z
    Status:                True
    Type:                  APIServerPodHealthy
    Last Transition Time:  2021-07-13T18:38:24Z
    Status:                True
    Type:                  BootstrapReady
    Last Transition Time:  2021-07-17T10:58:54Z
    Status:                True
    Type:                  ControllerManagerPodHealthy
    Last Transition Time:  2021-07-18T13:16:49Z
    Status:                True
    Type:                  EtcdMemberHealthy
    Last Transition Time:  2021-07-13T18:56:53Z
    Status:                True
    Type:                  EtcdPodHealthy
    Last Transition Time:  2021-07-13T18:56:50Z
    Status:                True
    Type:                  HealthCheckSucceeded
    Last Transition Time:  2021-07-13T18:45:12Z
    Status:                True
    Type:                  InfrastructureReady
    Last Transition Time:  2021-07-13T18:56:49Z
    Status:                True
    Type:                  NodeHealthy
    Last Transition Time:  2021-07-13T18:56:53Z
    Status:                True
    Type:                  SchedulerPodHealthy
  Infrastructure Ready:    true
  Last Updated:            2021-07-13T18:56:50Z
  Node Ref:
    API Version:        v1
    Kind:               Node
    Name:               tkg-cluster-control-plane-xztwk
    UID:                74b8ce8c-a282-4caf-9545-abd3696361ba
  Observed Generation:  3
  Phase:                Running

Control Plane WCPMachine

Status:
  Conditions:
    Last Transition Time:  2021-07-13T18:45:12Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2021-07-13T18:45:12Z
    Status:                True
    Type:                  VMProvisioned
  Ready:                   true
  Vm ID:                   4214a174-360d-09f4-2672-6bf3bc323984
  Vm Ip:                   192.168.120.8
  Vmstatus:                ready

 

Once the Machine Objects are ready, the VirtualMachine custom resources are created and the VMService (also know as the VM Operator Controller) reconciles those resources into API calls to vCenter to instantiate the actual VMs.  Note the difference between a VirtualMachine resource which holds the specification and status of the VM vs the actual Virtual Machine created in vCenter.  This has been a point of confusion for those new to Kubernetes and the VM Service.

 

image 117

The VirtualMachine Status is very specific to a vSphere deployed VM.  kubectl describe vm "VMName" shows things like which host the VM is deployed on, Power Status, IP Address, and the Managed Object Reference ID (MOID).  

 

Virtual Machine

Status:
  Bios UUID:              4214a174-360d-09f4-2672-6bf3bc323984
  Change Block Tracking:  false
  Conditions:
    Last Transition Time:  2021-07-13T18:48:59Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2021-07-13T18:38:26Z
    Status:                True
    Type:                  VirtualMachinePrereqReady
  Host:                    esx-02a.corp.local
  Instance UUID:           50143365-56f2-636e-8207-40a932645926
  Network Interfaces:
    Connected:  true
    Ip Addresses:
      192.168.120.8/24
      fe80::250:56ff:fe94:a347/64
    Mac Address:  00:50:56:94:a3:47
    Connected:    true
    Ip Addresses:
      fe80::748b:89ff:fee1:2fda/64
    Mac Address:  76:8b:89:e1:2f:da
    Connected:    true
    Ip Addresses:
      fe80::fc5e:bff:feea:a037/64
    Mac Address:  fe:5e:0b:ea:a0:37
    Connected:    true
    Ip Addresses:
      fe80::3c67:bfff:fe47:8886/64
    Mac Address:  3e:67:bf:47:88:86
    Connected:    true
    Ip Addresses:
      fe80::ece4:b8ff:fe48:5c32/64
    Mac Address:  ee:e4:b8:48:5c:32
    Connected:    true
    Ip Addresses:
      fe80::6cf6:88ff:fe55:62ed/64
    Mac Address:  6e:f6:88:55:62:ed
    Connected:    true
    Ip Addresses:
      fe80::40f2:69ff:fe7a:6d51/64
    Mac Address:  42:f2:69:7a:6d:51
  Phase:          Created
  Power State:    poweredOn
  Unique ID:      vm-5021
  Vm Ip:          192.168.120.8

 

Cluster API Controllers

This blog is not meant to be an exhaustive explanation of the reconciliation of each Custom Resource by the Controllers that watch them,  but more to provide insight into the hierarchy of resources and basic navigation up and down the stack when attempting to troubleshoot an issue.  When an error surfaces in the TKC resource, describing the appropriate lower level custom resources often provides the root cause.  Sometimes that is not the case and further investigation needs to be done in the Controller that reconciles the resource.   You see this information through the kubectl logs "Controller Name" -n "Namespace Name" command.  In this case it is useful to have some understanding of which controllers are acting on a particular resource.   Let's take a high level look at a couple of these controllers.  Note that most controllers are deployed as a ReplicaSet with three pods in order to increase availability.  You may need to check the logs for multiple pods to determine which one is the leader and is writing log info.  

ClusterAPI Controller (CAPI) reconciles all of the ClusterAPI resources except the WCP specific objects.  Reconciliation includes insuring the health and desired state of the objects as well as reflecting appropriate information between various objects. 

ClusterAPI Controller for WCP (CAPW) reconciles the WCPMachine resources and creates the VirtualMachine resources.  It also interacts with VirtualNetwork resources and can be a starting point for troubleshooting networking issues.

Virtual Machine Operator Controller reconciles the VirtualMachine resource into Virtual Machines through API calls to vCenter.  The VMOperator makes use of the VirtualMachineImage resource which holds the base image for the VM and the VirtualMachinClass which defines available resources (vCPU, RAM and soon GPUs) for the VM. 

vmware-system-capw                          capi-controller-manager-644998658d-9mg7d                          2/2     Running     2          5d19h
vmware-system-capw                          capi-controller-manager-644998658d-ffgdt                          2/2     Running     0          4d14h
vmware-system-capw                          capi-controller-manager-644998658d-kxk6v                          2/2     Running     3          5d19h
vmware-system-capw                          capi-kubeadm-bootstrap-controller-manager-65b8d5c4dc-b9q76        2/2     Running     0          4d14h
vmware-system-capw                          capi-kubeadm-bootstrap-controller-manager-65b8d5c4dc-rqbfg        2/2     Running     3          5d19h
vmware-system-capw                          capi-kubeadm-bootstrap-controller-manager-65b8d5c4dc-vxv7m        2/2     Running     0          5d19h
vmware-system-capw                          capi-kubeadm-control-plane-controller-manager-8565c86bbf-9ss6p    2/2     Running     3          5d19h
vmware-system-capw                          capi-kubeadm-control-plane-controller-manager-8565c86bbf-c22jl    2/2     Running     0          4d14h
vmware-system-capw                          capi-kubeadm-control-plane-controller-manager-8565c86bbf-dwhsc    2/2     Running     1          5d19h
vmware-system-capw                          capw-controller-manager-58d769bd99-w2h7p                          2/2     Running     2          5d19h
vmware-system-capw                          capw-controller-manager-58d769bd99-w56hq                          2/2     Running     0          4d14h
vmware-system-capw                          capw-controller-manager-58d769bd99-z2rft                          2/2     Running     3          5d19h
vmware-system-vmop                          vmware-system-vmop-controller-manager-7578487c6f-85tkc            2/2     Running     1          5d19h
vmware-system-vmop                          vmware-system-vmop-controller-manager-7578487c6f-pxd88            2/2     Running     6          5d19h
vmware-system-vmop                          vmware-system-vmop-controller-manager-7578487c6f-wbth6            2/2     Running     0          4d14h

 

The goal of this blog was not to make you an expert on Cluster API and the vSphere with Tanzu implementation, but to drive a basic understanding of the structure of custom resources.   The TanzuKubernetesCluster (TKC) is the top resource.  Kubectl describe tkc "ClusterName" is the troubleshooting starting point and will tell you the status of the cluster.   Depending on where error messages show up, you generally will look at WCPMachine and VirtualMachine objects for more detail.   Errors in these components - or if they don't exist - mean you go to the associated controller and check the logs for a root cause.   The following video will walk through some of the material in this blog and look at the resources in a live environment.  Follow on videos in this series will look at common errors and walk through some troubleshooting scenarios. 

Tanzu Kubernetes Grid (TKG) Troubleshooting Deep Dive Part 2 Video

 

Quick Links to Entire Troubleshooting Blog/Video Series.

Abstract: Define Kubernetes custom resources, Cluster API and show how they are used to Lifecycle TKG clusters.  Focus on how that impacts troubleshooting

Blog: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 1

Video: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 1

Abstract: Show how the TKC is decomposed into a set of resources that are an implementation of Cluster API.  Focus on how that impacts troubleshooting

Blog: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 2

Video: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 2

Abstract: Cluster creation milestones and identify common failures and remediation at each level

Blog: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 3

Video: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 3

Special Thanks to Winnie Kwon, VMware Senior Engineer. Her engineering documents and willingness to answer many questions were the basis for the creation of this series of blogs/videos.

 

 

Filter Tags

Modern Applications vSphere with Tanzu Kubernetes Tanzu Kubernetes Grid Blog Deep Dive Intermediate Advanced Manage