August 09, 2021

Tanzu Kubernetes Grid Service Troubleshooting Deep Dive - Part 3

This is the third in a series of blogs and supporting videos that dives into the components of the TKG Service and provides a roadmap for troubleshooting failures. The focus here is the cluster creation milestones and troubleshooting possible misconfigurations or failures at each level.

The Tanzu Kubernetes Grid (TKG) enables DevOps to deploy upstream aligned, fully conformant Kubernetes clusters through straightforward yaml specifications.  Those specifications make reference to underlying infrastructure managed by the IT Admin teams.  Failures in cluster deployment often are related to misconfiguration, access controls or resource constraints in this infrastructure.  Consequently, IT Admins are drawn in to troubleshooting activity that requires some understanding of Kubernetes and Cluster API.

The first blog/video in this TKG Troubleshooting series focused on an introduction to Kubernetes custom resources and the Cluster API open Source project, along with the creation of the Tanzu Kubernetes Cluster (TKC) resource.   Part 1 blog/video is available here

The second blog/video looked at how the TKC is decomposed into a set of resources that are an implementation of Cluster API.  It also looked at how the ClusterAPI resources are further reconciled into Virtual Machine resources that hold the configuration required to create VMs on vSphere and form a Kubernetes cluster.   Part 2 blog/video is Here.

This blog/video will look at the cluster creation milestones and identify common failures at each level.  It will also look at the custom resources and controllers that might hold relevant error messages and provide potential resolutions.

Cluster Creation Milestones

 

Milestone 1: TanzuKubernetesCluster (TKC) Validation

The cluster specification is parsed and sent to the Supervisor cluster Kubernetes API.  The first step is to ensure that the user presents a valid token that authenticates them to access the API.  Users login to be authenticated and receive the token.  Admission controllers have been added to the API to provide checks to ensure that resources defined in the specification are valid.  These checks include things like checking that the number of control plane nodes is either 1 or 3, the storage policy has been assigned to the namespace or that the Kubernetes version reflects a valid vm image available to the cluster.

image 148

Once the admission controller checks are complete, the TKC resource is created.  As you will see in milestone 2, the TKC is reconciled by the TKG controller and various CAPI/CAPW objects are created.  CAPW handles the creation of VirtualMachine resources and that is the point where the VMClass gets validated.   This can be confusing because the TKC will be created, but cluster reconciliation will fail at VirtualMachine resource creation.  I've included it in milestone 1 because it is a validation of the specification.

image 149

This is a good point to talk about TKC status.  Status provides a comprehensive view into the progress of cluster creation.  Lower level resources are designed to "Reflect" their status up the stack to higher level resources.  TKC status will show a summary of error messages that might have happened lower in the stack and should be the starting point for all troubleshooting.  kubectl describe tkc "cluster name" will show the status.

Status:
Last Transition Time:  2021-07-29T14:58:09Z
    Status:                True
    Type:                  AddonsReady
    Last Transition Time:  2021-07-29T14:58:50Z
    Status:                True
    Type:                  ControlPlaneReady
    Last Transition Time:  2021-07-27T17:50:11Z
    Message:               1/1 Control Plane Node(s) healthy. 2/2 Worker Node(s) healthy
    Status:                True
    Type:                  NodesHealthy
Phase:                   running

 

Milestone 2: Cluster API Objects are Created.

The TKG Controller has reconciled the TKC resource and created the CAPI/CAPW objects (discussed in more detail in blog/video 2 of this series)   These objects will all be created at once and you can list the most useful ones with kubectl get machines, wcpmachines.  Since the creation of these resources doesn't involve any underlying infrastructure, it is difficult to come up with a scenario for this to fail other than an issue with the TKG controller itself. (Note that reconciliation of these resources after creation most definitely involves the infrastructure).  You can view the controllers with kubectl get pods -n vmware-system-tkg  and the logs with kubectl logs "controller name" -n vmware-system-tkg -c manager.   Note that most controllers run as ReplicaSets with three replicas.  Log information will only be written to the leader, so you may have to try the log command on all three log files

vmware-system-tkg-controller-manager-668565499-5qkkc   2/2     Running        16        13d
vmware-system-tkg-controller-manager-668565499-k56xv   2/2     Running        1          3d4h
vmware-system-tkg-controller-manager-668565499-wqs6q   2/2    Running       8          8d

kubectl logs vmware-system-tkg-controller-manager-668565499-k56xv -n vmware-system-tkg -c manager
I0809 04:34:14.022061       1 main.go:72] entrypoint "msg"="Starting guest cluster controller"  "buildnumber"="18025172" "buildtype"="dev" "version"="1.4-129-g3540131"
I0809 04:34:14.022530       1 main.go:155] entrypoint "msg"="creating manager"  "buildNumber"="18025172" "buildType"="dev" "options"={"podNamespace":"vmware-system-tkg","podName":"vmware-system-tkg-controller-manager","metricsAddr":"127.0.0.1:8087","cloudProvider":"external","guestClusterClientTimeout":10000000000,"nodeRemediationTimeout":3600000000000,"maxConcurrentReconciles":3,"syncPeriod":600000000000,"leaderElectionEnabled":true} "version"="1.4-129-g3540131"
I0809 04:34:15.889234       1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"="127.0.0.1:8087"
I0809 04:34:15.889887       1 manager.go:83]  "msg"="Feature state switch config"  "WCP_GuestCluster_Antrea"="true" "WCP_GuestCluster_MM"="true" "WCP_GuestCluster_NCP"="false" "WCP_GuestCluster_Update"="true"
I0809 04:34:15.890009       1 guest_cluster_controller.go:111] vmware-system-tkg-controller-manager/addons-controller "msg"="Adding watch on TanzuKubernetesCluster objects"  
I0809 04:34:15.890770       1 guest_cluster_controller.go:122] vmware-system-tkg-controller-manager/addons-controller "msg"="Adding watch on CAPI Cluster resources"  
I0809 04:34:15.891039       1 guest_cluster_controller.go:111] vmware-system-tkg-controller-manager/authsvc-controller "msg"="Adding watch on TanzuKubernetesCluster objects"  
I0809 04:34:15.891055       1 guest_cluster_controller.go:122] vmware-system-tkg-controller-manager/authsvc-controller "msg"="Adding watch on CAPI Cluster resources"  
I0809 04:34:15.891134       1 authsvc_controller.go:87] vmware-system-tkg-controller-manager "msg"="Adding watch on ConfigMap objects"  
I0809 04:34:15.891700       1 authsvc_controller.go:127] vmware-system-tkg-controller-manager "msg"="Finished adding watches"  
I0809 04:34:15.891730       1 guest_cluster_controller.go:111] vmware-system-tkg-controller-manager/defaultpsp-controller "msg"="Adding watch on TanzuKubernetesCluster objects"  
I0809 04:34:15.891744       1 guest_cluster_controller.go:122] vmware-system-tkg-controller-manager/defaultpsp-controller "msg"="Adding watch on CAPI Cluster resources"  
I0809 04:34:15.891850       1 tanzukubernetescluster_controller.go:219] vmware-system-tkg-controller-manager/tanzukubernetescluster-spec-controller "msg"="Adding watch on TanzuKubernetesCluster objects"  
I0809 04:34:15.891944       1 tanzukubernetescluster_controller.go:86] vmware-system-tkg-controller-manager/tanzukubernetescluster-status-controller "msg"="Adding watch on TanzuKubernetesCluster objects"  
I0809 04:34:15.892084       1 guest_cluster_controll

Milestone 3: First Control Plane Node Virtual Machine Created  

The creation of the Virtual Machines that make up the cluster nodes begins with the creation of the first Control Plane Node.  CAPW creates a single VirtualMachine Resource and VMop Controller reconciles it into an actual Virtual Machine.  kubectl get vm to see VirtualMachine resources.  The VMop Controller must be able to authenticate to vCenter and have access to appropriate resources.  kubectl get pods -n vmware-system-vmop to see the VMop controllers.  Errors in the creation of the VM would show up in the VMop logs.  kubectl logs "controller name" -n vmware-system-vmop -c manager

kubectl get vm
NAME                                         POWERSTATE   AGE
tkg-cluster-control-plane-xmst5              poweredOn    12d

kubectl get pods -n vmware-system-vmop
NAME                                                     READY   STATUS         RESTARTS   AGE
vmware-system-vmop-controller-manager-7578487c6f-n4g5q   2/2     Running        10         8d
vmware-system-vmop-controller-manager-7578487c6f-qhmq7   2/2     Running        2          3d4h
vmware-system-vmop-controller-manager-7578487c6f-vc457   2/2     Running        18         13d

kubectl logs vmware-system-vmop-controller-manager-7578487c6f-qhmq7 -n vmware-system-vmop -c manager
I0808 16:40:42.823695       1 main.go:113] entrypoint "msg"="Starting VM Operator controller"  "buildnumber"="18066634" "buildtype"="dev" "commit"="46b9cb0" "version"="1.4-159-g46b9cb0"
I0808 16:40:42.824403       1 main.go:230] entrypoint "msg"="Configuring rate limiter"  "QPS"=500 "burst"=1000
I0808 16:40:42.824429       1 main.go:238] entrypoint "msg"="Profiler listening for requests"  "profiler-address"=":8073"
I0808 16:40:42.824505       1 main.go:244] entrypoint "msg"="wait for webhook certificates"  
I0808 16:40:42.829899       1 main.go:258] entrypoint "msg"="creating controller manager"  
I0808 16:40:43.228966       1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"="127.0.0.1:8083"
I0808 16:40:43.231290       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachine"
I0808 16:40:43.231403       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachine"
I0808 16:40:43.231513       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachineclass"
I0808 16:40:43.231560       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachineclass"
I0808 16:40:43.231991       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachineservice"
I0808 16:40:43.232138       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachineservice"
I0808 16:40:43.232245       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachinesetresourcepolicy"
I0808 16:40:43.232288       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachinesetresourcepolicy"
I0808 16:40:43.232311       1 main.go:267] entrypoint "msg"="starting controller manager"  
I0808 16:40:43.234301       1 internal.go:356] controller-runtime/manager "msg"="starting metrics server"  "path"="/metrics"
I0808 16:40:43.235251       1 server.go:141] controller-runtime/webhook/webhooks "msg"="starting webhook server"  
I0808 16:40:43.236334       1 certwatcher.go:127] controller-runtime/certwatcher "msg"="Updated current TLS certificate"  
I0808 16:40:43.236573       1 server.go:199] controller-runtime/webhook "msg"="serving webhook server"  "host"="" "port"=9878
I0808 16:40:43.309823       1 certwatcher.go:83] controller-runtime/certwatcher "msg"="Starting certificate watcher"  
I0808 16:40:43.309976       1 leaderelection.go:242] attempting to acquire leader lease  vmware-system-vmop/vmware-system-vmop-controller-manager-runtime...
cli-vm: kubectl logs vmware-system-vmop-controller-manager-7578487c6f-vc457 -n vmware-system-vmop -c manager |more
I0807 09:15:36.988076       1 main.go:113] entrypoint "msg"="Starting VM Operator controller"  "buildnumber"="18066634" "buildtype"="dev" "commit"="46b9cb0" "version"="1.4-159-g46b9cb0"
I0807 09:15:36.989239       1 main.go:230] entrypoint "msg"="Configuring rate limiter"  "QPS"=500 "burst"=1000
I0807 09:15:36.989286       1 main.go:238] entrypoint "msg"="Profiler listening for requests"  "profiler-address"=":8073"
I0807 09:15:36.989323       1 main.go:244] entrypoint "msg"="wait for webhook certificates"  
I0807 09:15:36.989392       1 main.go:258] entrypoint "msg"="creating controller manager"  
I0807 09:15:37.190415       1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"="127.0.0.1:8083"
I0807 09:15:37.191729       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachine"
I0807 09:15:37.191896       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachine"
I0807 09:15:37.192189       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachineclass"
I0807 09:15:37.192264       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachineclass"
I0807 09:15:37.192406       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachineservice"
I0807 09:15:37.192478       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachineservice"
I0807 09:15:37.192637       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-validate-vmoperator-vmware-com-v1alpha1-virtualmachinesetresourcepolicy"
I0807 09:15:37.192706       1 server.go:121] controller-runtime/webhook "msg"="registering webhook"  "path"="/default-mutate-vmoperator-vmware-com-v1alpha1-virtualmachinesetresourcepolicy"

Milestone 4:  First Control Plane Node Powered On.  

TKC Status field or kubectl get vm will show you the power status.  Failure to successfully power on the VM is usually a resource issue.  That issue can be at the VC or in limits placed on the Namespace.  You might also need to check HA Admission Control to see if failover resources have impacted VM placement or PowerOn.  If you don't see a root cause in the TKC or VM resources, check the VMop controller logs.

kubectl get vm
NAME                                         POWERSTATE   AGE
tkg-cluster-control-plane-xmst5              poweredOn    12d

Milestone 5: First Control Plane Node Gets IP Address.  

You will want to check that the Cluster has been assigned a Virtual IP (VIP) that provides ingress to the cluster through the Load Balancer and that the Control Plane Node itself has an IP.  Kubectl get services shows the internal cluster IP and the Load Balancer VIP.  kubectl get endpoints will show the control plane node endpoints that the Load Balancer will route to.  IPs assigned in both commands shows that network assignment is working properly.  (Note that it does not necessarily mean the underlying networks are routing properly)

kubectl get services
NAME                                TYPE           CLUSTER-IP    EXTERNAL-IP       PORT(S)          AGE
tkg-cluster-control-plane-service   LoadBalancer   10.96.0.251   192.168.130.130   6443:30431/TCP   12d

kubectl get endpoints
NAME                                ENDPOINTS   AGE
tkg-cluster-control-plane-service   192.168.120.9      12d

 

Milestone 6: First Control Plane Kubernetes Bootstraps Successfully

Until this milestone, the work has been all about getting the VM placed onto storage, powered on and connected to the appropriate networks.  Here the focus is bootstrapping the Kubernetes components within the VM.  As described in more detail in Video 2 of the series, KubeAdm configuration is delivered through Cloud-Init to drive this bootstrap process.   Errors happening here are generally related to connectivity or failures in one of the controllers and are difficult to troubleshoot.  The starting point is to ssh into the cluster node (those steps are in the documentation and in the video for this blog).  tail /var/log/cloud-init.log |grep SUCCESS will allow you to see if cloud init finished successfully.   You may also determine if the Kubernetes processes are running with ps -e |grep kube     Finally, the Kubernetes agent, kubelet, logs to the system journal.  You can find that with sudo journalctl -u kubelet.

tail /var/log/cloud-init.log |grep SUCCESS 
2021-07-27 17:35:49,124 - handlers.py[DEBUG]: finish: modules-final/config-power-state-change: SUCCESS: config-power-state-change ran successfully
2021-07-27 17:35:49,169 - handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final

 

 Milestone 7: All Nodes Created and Bootstrapped Successfully

I've consolidated a few steps here.  Once Kubernetes is running in the node, post configuration Add-ons are applied.  The Add on Status is part of the Status/Conditions section of the TKC resource and could include things like CNI, CSI, Metrics, logging , etc.  The rest of the cluster nodes will go through a similar creation and bootstrap process as the first control plane node.  As nodes are created, they will be bootstrapped and joined to the cluster.  Their status will be reflected into the TKC, but also available in their respective machine/wcpmachine resources.   Once this process completes, the TKC will show the cluster PHASE as Running.   You can now log in to the cluster.

Status:
  Addons:
    Conditions:
      Last Transition Time:  2021-07-27T17:35:34Z
      Status:                True
      Type:                  Provisioned
    Name:                    CoreDNS
    Type:                    DNS
    Version:                 v1.7.0_vmware.7
    Conditions:
      Last Transition Time:  2021-07-27T17:35:35Z
      Status:                True
      Type:                  Provisioned
    Name:                    kube-proxy
    Type:                    Proxy
    Version:                 1.20.2+vmware.1
    Conditions:
      Last Transition Time:  2021-07-27T17:45:21Z
      Status:                True
      Type:                  Provisioned
    Name:                    defaultpsp
    Type:                    PSP
    Version:                 v1.20.2+vmware.1-tkg.2.3e10706
    Conditions:
      Last Transition Time:  2021-07-29T14:58:08Z
      Status:                True
      Type:                  Provisioned
    Name:                    antrea
    Type:                    CNI
    Version:                 v0.11.3_vmware.1
    Conditions:
      Last Transition Time:  2021-07-29T14:58:09Z
      Status:                True
      Type:                  Provisioned
    Name:                    pvcsi
    Type:                    CSI
    Version:                 vsphere70u2-dade960-7edda53
    Conditions:
      Last Transition Time:  2021-07-29T14:58:08Z

 

The following video will walk through some of the material in this blog and look at the resources in a live environment.

Tanzu Kubernetes Grid (TKG) Troubleshooting Deep Dive Part 3 Video

 

Quick Links to Entire Troubleshooting Blog/Video Series.

Abstract: Define Kubernetes custom resources, Cluster API and show how they are used to Lifecycle TKG clusters.  Focus on how that impacts troubleshooting

Blog: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 1

Video: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 1

Abstract: Show how the TKC is decomposed into a set of resources that are an implementation of Cluster API.  Focus on how that impacts troubleshooting

Blog: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 2

Video: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 2

Abstract: Cluster creation milestones and identify common failures and remediation at each level

Blog: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 3

Video: Troubleshooting Tanzu Kubernetes Grid Clusters - Part 3

Special Thanks to Winnie Kwon, VMware Senior Engineer. Her engineering documents and willingness to answer many questions were the basis for the creation of this series of blogs/videos.

Filter Tags

Modern Applications vSphere with Tanzu Kubernetes Tanzu Kubernetes Grid Blog Demo Video Deep Dive Advanced Deploy Manage