vSphere with Tanzu - Highly Available Kubernetes
First Publication: Dec 03, 2020By Michael West
Technical Product Manager, VMware
December 2020
vSphere with Tanzu delivers on-demand, production ready Kubernetes as part of vSphere. That integration provides highly available and resilient Kubernetes clusters to DevOps teams by leveraging capabilities built into vSphere and Kubernetes itself. Enabling vSphere with Tanzu on a vSphere cluster creates a Kubernetes control plane and turns the ESXi hosts into Kubernetes Nodes capable of running pods directly. We call this a Supervisor Cluster and view it as an enabling technology for layering both VMware and third-party Operators into cloud services directly consumable by DevOps teams.
The Tanzu Kubernetes Grid Service or TKG, is implemented as a set of Controllers and Custom Resources deployed on the Supervisor Cluster to provide lifecycle management of TKG clusters defined through DevOps specifications. TKG clusters are upstream aligned, fully conformant Kubernetes clusters that can be deployed directly by DevOps teams through a simple declarative specification. It's important to ensure that both Supervisor and TKG clusters are resilient against infrastructure and other unplanned outages in the system. Let’s look at elements of the cluster implementations that increase this resiliency.
vSphere with Tanzu High Availability
Let’s start by looking generically at Public Cloud solutions as they offer a template for providing infrastructure and critical service isolation. Public Cloud resources are hosted in geographic regions that are segmented into Availability Zones (AZ) that provide a set of isolated physical infrastructure and services. Resources can be Regional or AZ based, however most are contained in a single availability zone. For example, the Virtual Machine nodes that make up a Kubernetes cluster might be AZ specific and would only be able to attach a persistent volume that was in the same zone. Static IP addresses or VM Images might be regional and available across zones. AZs provide a layer of abstraction over the physical infrastructure supporting it. That infrastructure is designed to ensure isolation from most physical and software infrastructure failures. Most AZs are designed with their own power, cooling, networking and control planes isolated from other zones.
vSphere resources are hosted on physical infrastructure that can be isolated into vSphere Clusters and managed by vCenter. For the purpose of comparison with Public Cloud, we can think of a vSphere Cluster as an Availability Zone (AZ). ESXi hosts are grouped into physical datacenter racks, connected together - and to other racks - via a series of network switches. Organizations make decisions about whether to align racks to vSphere clusters or to spread a cluster’s hosts across racks. One to one mapping between clusters and racks provides isolation and low latency communication between apps running in the cluster, at the possible expense of availability in the case of failed switching.
vSphere with Tanzu currently supports the deployment of Supervisor and Tanzu Kubernetes Grid (TKG) clusters into a single vSphere cluster. Think of this as a single AZ deployment. A single AZ deployment does not mean no High Availability. As described below, vSphere provides a set of capabilities to promote the availability of Tanzu Kubernetes Grid (TKG) clusters running in a single AZ.
Let’s look at capability in the platform that supports High Availability, then define a set of failure scenarios within a vSphere with Tanzu deployment and break out the approaches for increasing Availability.
Cluster Multi-Node Control Plane with Host Anti-Affinity and HA
The Supervisor Cluster provides the Kubernetes API for deploying both vSphere Pods and TKG clusters. It is a system service that must be available for operation of TKG clusters and Services deployed as Kubernetes Operators. To ensure availability, we have implemented a multi-controller configuration. Both Supervisor and TKG cluster Controllers are configured in a stacked deployment with Kube-API server and etcd available on each VM. Each Control Plane is frontended by a Load Balancer that routes traffic to the Kube-APIs on each controller node. Scheduling of Control plane nodes is done through the vSphere 7.0 VMware vSphere® Distributed Resource Scheduler™ (DRS). vSphere 7.0 DRS is faster and lighter, ensuring rapid placement across hosts, using a soft anti-affinity compute policy to separate the Controllers onto separate hosts when possible. Anti-affinity ensures that a single controller failure does not impact the availability of the cluster.
Further, vSphere HA provides high availability for failed controllers by monitoring for failure events and powering the VMs back on. In the event of host failure, the VMs – both Supervisor and TKG nodes are restarted on another available host, dramatically reducing the time the still available cluster is running with diminished capacity. Supervisor Controllers are lifecycle managed by vSphere Agent Manager (EAM) and cannot easily be powered off by Admins, however if they are powered off, they will immediately be restarted on the same host.
Cluster System Pods and Daemons
Systemd is the init system for the PhotonOS linux distribution that powers the Supervisor and TKG Cluster nodes. Systemd provides process monitoring and will automatically restart any critical failed daemons, like the Kubernetes kubelet.
Kubernetes System Pods, VMware Custom Resource Pods and Controllers are deployed as replicasets, (or DaemonSets in the context of TKG clusters) watched by other Kubernetes controllers and will be recreated by containerd in the case of failure. The processes and pods that make up the Supervisor and TKG clusters are most often deployed with multiple replicas, so continue to be available even during the reconciliation of a failed Pod. The implementation in terms of the use of DaemonSets vs ReplicaSets is different across Supervisor and TKG clusters, but the availability of the individual system pods is similar
Supervisor and TKG Cluster Controller Load Balancer
As previously mentioned, Supervisor and TKG clusters are deployed as a set of three Controllers. In the case of TKG clusters, developers can choose a single controller deployment if resources are constrained. Access to the kube-api is directed to a Load Balancer Virtual IP and is distributed across the available API endpoints. Loss of a Control Plane node does not impact the availability of the cluster. NSX Based Loadbalancers run in Edge Nodes that can be deployed in an Active/Active configuration ensuring the availability of the Virtual IP that Developers will use to access the clusters.
Cluster Storage Availability
Customers have the option of enabling VMware vSAN to provide highly available shared storage to both the Supervisor and TKG clusters. vSAN combines storage that is local to the hosts into a virtual storage array that can be configured with the redundancy needs of the environment. RAID levels are implemented across the hosts and provide availability in cases where individual hosts fail. vSAN is native to the ESXi hypervisor and so it does not run as a set of storage appliances that might compete with Kubernetes cluster VMs for resources.
vSphere Pods Availability
Customers that need the isolation of a Virtual Machine from a security and performance perspective, while still wanting to orchestrate the workload via Kubernetes, can choose vSphere Pods. System and third-party Operators can also be deployed as vSphere pods. They run in a container runtime Virtual Machine directly on ESXi hosts and behave just like any other pod. Availability is managed through standard Kubernetes objects and follows the Kubernetes pattern for handling pod failure. Create a deployment with multiple replicas defined and the Controller will watch for Failed Pods and recreate them. Host failures would cause the vSphere Pods to be recreated on new hosts. This pattern does not use vSphere HA, but instead takes advantage of kubernetes controllers to handle failures.
Infrastructure and Software Failure Scenarios
We looked at core capability within the platform and now will define a set of failure scenarios within a vSphere with Tanzu deployment, then look at the approaches for increasing availability in each scenario.
Kubernetes Cluster component failure
- Kubernetes Node Process Failure
- Kubernetes System Pod Failure
- Virtual Machine Powered Off
When a controller or worker node becomes unavailable, any system pods will be recreated on one of the other available nodes. Failed Linux processes are monitored and restarted by the systemd init system. This is true for both TKG Control plane and worker nodes as well as the Supervisor control plane. Supervisor Worker nodes are actual ESXi hosts, do not run systemd, and handle process failure differently. Virtual Machines that become powered off are automatically restarted. The control plane is deployed as a set of 3 controllers, with a Load Balancer Virtual IP as the Kube-API endpoint to ensure that during the time needed to remediate the outage, the cluster is still available. Remediation of TKG worker nodes is handled through integrated use of the MachineHeatlhCheck capability in ClusterAPI V1Alpha3.
Host Outage
- Host Failure
- Host Restart
Unavailability of an ESXi host may be a short reboot or could be related to an actual Hardware failure. On reboot, all Controller and Node VMs are automatically restarted on the host. If the reboot takes longer than the HA heartbeat timeout threshold, the VMs are scheduled for placement on other available hosts and powered on. Anti-affinity rules will be maintained if resources allow it. In either case, the clusters are still available because they have multiple controller nodes placed on separate ESXi hosts and accessed via a load balancer and can survive loss of one controller.
Rack Outage
- Top of Rack switch Failure
Hosts are organized into physical racks in the datacenter. Hosts in a rack are connected via a Top of Rack switch and further connected to additional aggregation switches. Loss of a switch, storage array or power could make an entire rack unavailable. Availability of a Supervisor or TKG cluster is dependent upon how hosts are aligned with racks. Some customers will place all hosts in a vSphere Cluster in the same rack, essentially making that physical rack an Availability Zone (AZ). Rack failure would make the Supervisor Cluster and all TKG clusters unavailable. Another option is to create vSphere Clusters from hosts that reside in multiple racks. This configuration ensures that a single rack failure does not make all hosts in the vSphere Cluster unavailable and increases the likelihood that the control plane remains available.
The Distributed Resource Scheduler (DRS) is not currently rack aware. This means that anti-affinity rules for Control Plane VMs would ensure host separation, but not necessarily ensure that the hosts were in separate racks. Work is in process to provide rack aware scheduling that would improve this availability.
Storage Array Failure
Supervisor and TKG clusters require shared storage to allow persistent volumes to be accessible to Pods that might be placed on any node – which could reside on any host. Both vSAN and traditional storage arrays support various RAID configurations which ensure the availability of data in the case of corruption of individual disks. vSAN aggregates local storage attached to physical hosts and can also remain available if a host or even potentially a rack failure were to occur.
Network Failure
- NSX Edge Failure
Physical redundancy is built into most production network environments that provide availability through physical hardware failure. When deployed with NSX-T, Supervisor and TKG clusters run on overlay networks that are supported by the underlying physical network. NSX Edge VMs contain the Load Balancer Vips and route traffic to the cluster control plane nodes. NSX Edges can be deployed in an Active/Active configuration that would keep the network available in the case of a Host or VM failure that impacted a single NSX Edge.
vCenter Outage
vCenter Server availability is critical to the fully functional operation of Supervisor and TKG clusters. In the case of a vCenter failure all Virtual Machines, both controller and nodes, will continue to run and will be accessible albeit with degraded functionality. Application Pods would continue to run, but deployment and management of the system would be limited. Authentication requires the Single Sign-On service to return an auth token to the kubectl config file. Only users that have a non-expired token would be able to access the system. Objects like Load Balancers and PersistentVolumeClaims require interaction with vCenter and could not be lifecycle managed.
vCenter can be protected with vCenter High Availability. vCenter High Availability (vCenter HA) protects not only against host and hardware failures but also against vCenter Server application failures. Using automated failover from active to passive, vCenter HA supports high availability with minimal downtime. This would dramatically increase the availability of the Supervisor and TKG clusters.
There is extensive work in process to decouple core services from vCenter Server and push them down to the cluster level. The goal is to provide always on infrastructure that continues to function even when vCenter has an outage.