Ask any VMware customer what they value about being able to virtualize workloads in vSphere and Security will always be high on the list. Everything from the robustness of ESXi, resource isolation, hardware attestation, backups, encryption, quality of service, fault tolerance, anomaly detection and more. Security is a broad topic that is critical to VMware's success in gaining the trust of our customers.
So, when we set out to design Kubernetes integration with vSphere and ESXi, it was always going to be more than just a fancy interface to a virtual K8S cluster. There are obvious conceptual similarities between the vSphere and Kubernetes control planes, but enough disparate strengths that we felt were worth combining into something that gives you the best of both worlds. The outcome of this is the concept of the “Supervisor” K8S cluster that can be enabled on a vSphere 7 ESXi cluster.
This blog post examines the concept and the role of the Supervisor in vSphere and explores the inherent differences between ESXi and Linux when it comes to securing applications. It goes on to examine how the inherent security posture of ESXi can be exploited with hardware-isolated “vSphere Pods” and finishes with a look at the challenges this presents to Kubernetes conformance testing.
The Supervisor cluster is a privileged K8S cluster managed by vSphere that significantly augments the capabilities of the vSphere cluster.
The high-level benefits can be summarized as:
- The ability to apply the Kubernetes desired state model with controllers and CRDs natively to vSphere compute, storage and networking
- Provision of Kubernetes-as-a-Service in which virtualized Tanzu Kubernetes Grid (TKG) clusters can be created on-demand by defining the cluster specification as a custom resource.
- Paravirtual capabilities that enhance the integration with TKG Kubernetes clusters such as single sign-on, RBAC, persistent volumes, load balancing and high availability.
- The ability to manage ESXi hosts as if they are Kubernetes nodes and therefore deploy Kubernetes applications directly to vSphere using hardware-isolated pods.
While these benefits can be succinctly expressed, the significance of the capabilities they open up is beyond the scope of this blog post. The lifecycle management of Kubernetes clusters using ClusterAPI CRDs allows vSphere administrators to offer self-service Kubernetes to their development teams. Creation, expansion, upgrade and deletion of TKG Kubernetes clusters can be done by simply creating or modifying a YAML object using kubectl talking to vSphere, vastly simplifying lifecycle management of Kubernetes on vSphere.
It is not an understatement to say that Kubernetes is making vSphere significantly more extensible and as a company we have only just started to take advantage of this opportunity.
The purpose of this blog post however is to focus on point (4) above, specifically as it relates to Security and Conformance.
Securing Modern Workloads
The task of securing a VM is well-understood at this point, although doing it effectively still requires intentional work at every layer of the stack. In particular given that each VM runs a full operating system, the attack surface is larger than is ideal for most applications. Indeed, this is one of the motivations behind the perennial interest in Unikernels.
While containers theoretically promise “just enough OS” both by convention and design to allow devOps engineers to avoid creating snowflake appliances, in reality they simply punt the problem to the configuration and security of the node running the containers, which can itself end up becoming the snowflake. Arguably the illusion of isolation that containers present can lead to complacency that can result in attack vectors via privilege escalation that can quickly compromise entire clusters.
That’s not to say that containers cannot be adequately secured. Just refer to the hundreds of pages of CIS security benchmarks that have been written on the topic. No, it’s not that there’s a fundamental design problem, but rather there is undeniably a complexity problem.
Complexity is the enemy of security
When you go on vacation and secure your home, you take comfort in the simplicity of knowing your doors are locked and the alarm is set. If every window had 10 different mechanisms, all of which had to be correctly oriented in order to prevent ingress, you’d be forgiven for having nagging doubts as you rush out of the door.
The more complex our software becomes, the more we tend to create abstractions and layers to make it easier to reason about. The complexity hasn’t gone away, but rather gets pushed to one side where it stagnates until we're forced to investigate it when something breaks. Default configurations in particular develop their own kind of pervasive inertia, persisting long after anyone can remember why they were ever chosen. So, if the layer we’re installing or configuring is not secure by default, the onus is squarely on the installer to not only understand it but do all the extra work necessary to get it to function securely.
This is actually a strong argument for automation. If the configuration and lifecycle management of your infrastructure is automated, it’s more likely to be homogeneous and easier to patch. Laborious and error-prone tasks such as certificate management can also be simplified.
That being said, your automation is only as good as the security of the software and configuration it’s applying. When you’re dealing with something as complex as virtualized Kubernetes, you need to trust that it’s going to be secure by default and that your automation will hide some of that complexity.
We can break down the security of Kubernetes, much like we can with vSphere, into 3 domains:
- Node security
- Control plane security
- Application security
I mainly want to focus in this post on Node Security, since this is where some of the most interesting distinctions lie. What I’m interested in exploring is where Kubernetes and vSphere differ in their approaches to security, examples where we’ve used our experience to meld the two and how you can take some of those opinionated “secure by design” approaches and apply them to your own Kubernetes environment.
The node is a critical isolation domain in any cluster. It should guarantee isolation to the containers or VMs running in it, while at the same time isolating itself from other nodes in the system and importantly from the control plane it depends on. This is a lot of responsibility for the humble node!
The vSphere approach to node security differs significantly from the approach taken by Kubernetes. ESXi is not a general-purpose OS and has a “share nothing” approach to virtualization. There is no shared filesystem, no shared TCP stack, there is no shared buffer cache, no shared kernel and no shared secrets. No user software runs on the host other than VMs, which can be thought of as opaque processes. It goes without saying that hardware virtualization provides a high level of isolation between VMs themselves.
A Kubernetes node on the other hand shares a lot with the containers it runs, both in ways that are inherent to the design of a container, but also in optional ways which can give the container privileged access to the host and/or vice versa. It is in this flexibility and the necessity to understand the impact of a node’s configuration that places significant responsibility in the hands of the K8S Cluster Admin to secure the cluster.
ESXi as a Kubernetes Node
There have been various efforts and initiatives over the years to deliver hardware-isolated containers and VMware was the first to market with a Docker API for vSphere that ran containers as VMs. No matter how lightweight those VMs are however, these implementations miss one significant use-case: The ability to run multiple weakly-isolated containers together in order to deliver a single function or service. This was solved by the Pod concept introduced by Kubernetes and as such, engineering focus at VMware shifted to the notion of the hardware-isolated pod.
I don’t want to get diverted on the specifics of the “CRX” native container implementation in ESXi, but suffice it to say that it was the motivation for thinking about whether it made sense for the Supervisor to be able to manage ESXi hosts as Kubernetes nodes. When we considered the opportunities around scheduler integration, this architectural approach made a lot of sense and that led us to the implementation of Spherelet, our Kubelet implementation for ESXi.
All of which is background to the topic of how ESXi can function as a Kubernetes node and how we took the “share nothing” architecture and made it work.
The storage on a Kubernetes node is assumed by design to be a shared filesystem that accomplishes multiple tasks. This includes:
- The configuration state for the operating system. Eg. Kubelet expects /etc/resolv.conf to be there to configure DNS
- Configuration state for the control plane. Kubelet needs to be able to authenticate with the control plane and so the necessary keys, certificates and URL are on the filesystem by default
- Container images. This can be configured to be a separate disk if desired but cannot be a shared filesystem with another node.
- Container ephemeral state. The container engine on a Kubernetes node uses a layered filesystem such that writes to a container root filesystem are stored in a directory on the node.
- Volume mounts. There are some types of volume that allow containers to read and write directly to the host filesystem. ConfigMap and Secret volumes are read-only, whereas emptyDir and hostPath are read-write.
- Logging. Container logs, Kubelet logs, audit logs etc.
Given the comprehensive nature of the kinds of state stored on the node and given that this state can be mapped in and out of containers, it’s critical to consider the consequences of it being compromised.
This is one area where the “share nothing” approach forced by ESXi delivers far stronger isolation, but also forced some significant implementation differences to deliver the same function. Most notably:
- Each pod gets its own private ephemeral storage disk. We still use the same layered filesystem under the covers, but it is not readable by other pods or even the host. The same is true for emptyDir volumes. Containers within a pod can of course share the ephemeral storage.
- Container images are not stored on the node. Each image is a disk stored on a vSphere datastore and the disks are attached and mounted as necessary. Images cannot be pulled via a shell on the node. Only via authenticated access to the vSphere control plane.
- Secret volumes are never stored on disk, not even the private ephemeral disk of the pod. They are pushed to RAMFS on the pod and as such are guaranteed to be ephemeral.
- ConfigMap volumes are pushed to the pod and persisted to the private ephemeral disk
- Containers log to the private ephemeral disk
- Spherelet logs to the private ESXi filesystem like any other ESXi system service
- HostPath volumes are disabled. There is no bind mounting or host filesystem, hence no such concept.
Note that the isolation not only delivers data privacy but also limits the impact of disk space or inodes being exhausted. Pod ephemeral storage in Kubernetes can be configured to be limited in size and that is a very good practice given the potential “noisy neighbor” impact of running the host out of disk space.
As you can see, ESXi forces a model in which everything is secure and private by default. The only functional difference is the inability to provide support for HostPath volumes. As the Kubernetes documentation states, “This is not something that most Pods will need”, as the main use-case is to allow a pod privileged access to the host for the purposes of monitoring or controlling it. This is antithetical to the “share nothing” model.
As with Storage, networking isolation between VMs and the ESXi host they’re running on is fundamental to the design. The standard networking on ESXi makes use of the underlying networking fabric to provide L2 isolation and vLANs can be abstracted as configurable “port groups” to which VM virtual NICs (vNIC) can be connected.
ESXi has its own private “vmkernel NICs” that can be configured for special network traffic such as management, SAN, live migration etc. ESXi has its own firewall to prevent access to and from the host on any port other than that which is necessary for the functioning of the cluster. NSX is a further enhancement that can provide more advanced SDN capabilities such as per-vNIC level firewalls.
In simple terms, the ESXi host functions as a conduit for VM network access, but a vNIC network stack is private to a VM and an L3 domain is accessible only to the other VMs on the same port group. The ESXi host’s IP and network traffic will be L2 isolated on a management network inaccessible to application traffic.
So, what does this mean for ESXi as a Kubernetes node? Well, the most obvious limitation is that the idea of the ESXi host serving application traffic via an IP:port combination is not only impossible, but completely antithetical to the “share nothing” design. Kubernetes allows this via the NodePort concept, which is one way in which a Kubernetes Service can be manifested.
Fortunately, NodePort is pretty much only used as an internal implementation detail and is not necessary for any of the production networking function Kubernetes provides:
• A ClusterIP service is accessible only to other pods in the cluster and doesn’t require NodePort
• A LoadBalancer service is achieved using an external Load Balancer configured in vSphere
• An Ingress controller will have its own external IP for managing ingress
The only other functional limitation ESXi presents to Kubernetes is the HostNetwork concept, which is a hangover from the Docker days. It allows a container to share the network stack of the host, which might have some limited use in monitoring, but makes no sense in a production environment.
Much has been written over the years about Containers vs VMs and the industry has largely settled on the narrative that they’re complimentary. In other words, VMs make good container hosts because they’re a strong isolation boundary and they’re much easier to manage than physical hardware.
What’s interesting about the concept of the strongly isolated pod is that the exact same argument can be made for the complimentary nature of containers and VMs, with the isolation boundary having just been drawn in a more granular fashion. The big security win here of course is that the operating system kernel is shared only with containers that are intended to be weakly isolated and together form a single deployable unit.
Such a design choice however means that cross-cutting capabilities must be applied differently. The concept of the DaemonSet that runs containers that extend the capabilities of a Kuberentes node is predicated on container features that allow privileged access to the host. This extensibility has provided the scope for a rich ecosystem of third-party tools to flourish, which is wonderful. However, in a world in which there is no such thing as privileged access to the host, how can this be achieved?
There are really two possible approaches as it stands today.
1. The scope of privilege is limited to the pod, not the node. Instead of deploying your cross-cutting capability as a DaemonSet, you deploy it as a privileged sidecar container. This can work for a subset of extensions, most obviously those dealing with monitoring. However, it can't work for extensions that make assumptions about implementation details of the node.
2. Deploy a virtual Kubernetes cluster with your extensions and use PodSecurityPolicies, namespaces, node labels, QoS configuration and whatever else you need to achieve the degree of isolation you want.
This is why vSphere with Tanzu offers you both options: The ability to deploy applications and services to vSphere directly as secure first-class-citizens AND the ability to easily manage virtual K8S clusters that can support all the ecosystem extensions.
We see these capabilities as being completely complimentary. The Kubernetes admin uses the same APIs, the same endpoint and the same YAML format to deploy both K8S clusters and vSphere services which coexist within the same resource and management domain. A perfect example of this would be a container registry that needs to serve multiple K8S clusters. You need it to be persistent, secure and performant, so it makes perfect sense to deploy it as a native pod.
Control Plane Security
Control plane security is easy to overlook, but it is absolutely critical to the security of the system as a whole. Even though ESXi has no knowledge or access to what’s running in a VM, the vSphere control plane does. The Kubernetes control plane allows you to run processes inside of existing containers as root which is an even more powerful capability.
This really could be a blog post all on its own, but for now let’s boil it down to who can access the control plane and how they can access the control plane.
The who question is largely solved through a combination of authentication, role-based access control (RBAC) and security policies. Configuring any of these things manually is a challenge and so they often get applied reactively after the fact, rather than proactively as part of the default configuration.
This is one area where Tanzu Kubernetes Grid Service shines because it uses a “least privilege” approach out of the box, automates away many of the difficult tasks and integrates with vSphere SSO throughout. In practice, this means that K8S admin needs to explicitly authorize actions in a cluster and can easily control aspects such as whether containers can be run as root.
Reducing the Attack Surface
The reality of the way Kubernetes is designed is that every component in the system is a client of the control plane. Kubelet is a great example – it must authenticate to the control plane just like any user for the cluster to function. In order to do that, it needs a certificate and URL which is most likely stored on the filesystem of the node. With execution permissions on kubectl and read permissions on the certificate, it would be easy to connect to the API server from a node using Kubelet’s identity. As such, preventing root access to the node and configuring appropriate file permissions is very important.
It is also possible to configure a pod such that the credentials and address of the control plane are made available to it. This is a very useful feature for some purposes, but it’s important to ensure that appropriate ServiceAccounts are used so that this is exceptional, rather than the default.
This is too broad a topic to cover in any depth here. But some high-level considerations include:
- Container image security – The Harbor project has all of these features:
- Authentication and access control to the image registry
- Assessing images for vulnerabilities and gating deployment on severity
- Reduce the attack surface by minimizing the contents of the image
- Anomaly detection – This is one area where Carbon Black comes into play
- High availability – This is important for nodes as well as pods
- Region / Zone support is coming in CAPV to help distribute nodes
- Secret management – Where are your secrets persisted and how can they be accessed?
- NetworkPolicies – An important consideration in how attacks can proliferate
- Default to block everything and require access to be made explicit
- Limit the ability of a containers to install new binaries from outside
- PodSecurityPolicies – Allow you to restrict the deployment of certain kinds of container
- Limit container ephemeral storage to prevent accidental or rogue disk space issues
- Limit container memory/CPU to prevent over-consumption
The Cloud Native Computing Foundation (CNCF) is the vendor neutral home for many open-source projects like Kubernetes. When it comes to Kubernetes, the CNCF provides a process for confirming that your Kubernetes implementation is conformant to the specification. Vendors will prepare their product, run the test, submit the results and a CNCF reviewer will approve the certification.
The question of Kubernetes Conformance when it comes to the Supervisor is one which we haven’t addressed directly to date, but merits discussion given the design choices discussed above. We run conformance tests internally against every build and we pay close attention to the results.
The Supervisor control plane is upstream Kubernetes and has two sets of nodes it manages. It has a set of “standard” Kubernetes nodes which run the api-server, etcd and all the controllers that provide the vSphere extensions that make up the product. Additionally, each of the ESXi hosts in the vSphere cluster register as nodes and a custom Kubernetes scheduler extension deploys “native pods” to those nodes.
Of course, if we were to run the conformance tests against the “standard” nodes only, the Supervisor would be fully conformant as any upstream K8S implementation. However, it’s important to us that our native pods and our ESXi nodes meet the same standards for conformance as the Linux nodes. This matters because we have no desire to fragment the Kubernetes standards, we want Pods on vSphere to “just work” and frankly we feel that the ecosystem is strengthened by innovation.
That said, due to the “share nothing” design of ESXi, we cannot support HostNetwork, HostPath or NodePort for these nodes, all of which are tested as part of conformance. As discussed in detail above, these capabilities require a break of isolation that ESXi and vSphere simply cannot allow. Aside from that, the Supervisor passes 100% of the remaining tests.
We are making our conformance results publicly available in GitHub along with more details to help customers understand the functional differences.
The ongoing shift towards modern container-based workloads is not just a question of how we design applications or even how we make them more efficient. It’s also changing how we think about infrastructure and our expectations of how we consume it and what it should offer.
VMware has only just begun on this journey of providing different ways for our customers to consume vSphere and Kubernetes is at the heart of that transformation.
Infrastructure that’s secure by design and which offers flexibility in how those security features are consumed is the best place to start when it comes to securing modern applications.