vSAN RA - Modern Applications - Kubernetes on VMware vSAN
Digital transformation is driving a new application development and deployment approach called cloud native. Kubernetes is an open source platform that automates cloud native application deployment and operations. It eliminates many of the manual processes involved in deploying and scaling containerized applications. Kubernetes gives the customer a platform to schedule and run containers on clusters of physical or virtual machines. With the release of VMware vSphere® 6.7 Update 3, Cloud Native Storage (CNS) builds on the legacy of the earlier vSphere Cloud Provider for Kubernetes and is paired with the new Container Storage Interface (CSI) for vSphere. CNS aims to improve container volume management and provide deep insight into how container applications running on top of vSphere infrastructure are consuming the underlying vSphere Storage.
VMware Hyperconverged Infrastructure (HCI), powered by VMware vSAN™, is the market leader in low cost and high performance next-generation HCI solutions. vSAN delivers TCO reduction benefits over traditional 3-tiered architecture by eliminating infrastructure silos. vSAN also allows customers to achieve infrastructure agility and scalability by converging traditional compute and storage silos onto industry-standard servers, dramatically simplifying operations. vSAN is a bundled component in VMware Cloud Foundation™.
Cloud Native Storage through the CSI driver on vSAN is natively integrated into VMware vCenter®. It is a solution that provides comprehensive data management for both stateless and stateful applications. When you use cloud native storage, you can create containerized stateful applications capable of surviving container restarts and outages. Stateful containers leverage storage exposed by vSphere that can be provisioned using Kubernetes primitives such as persistent volume, persistent volume claim, and storage class for dynamic provisioning.
The CSI driver on vSAN is an interface that exposes vSphere storage to containerized workloads managed by Kubernetes as well as other container orchestrators. It enables vSAN and other types of vSphere storage to support stateful applications. On Kubernetes, the CSI driver is used with the outof-tree vSphere Cloud Provider Interface (CPI). The CSI driver is shipped as a container image and must be deployed by the cluster administrator.
In this solution, we provide designed test scenarios, deployment procedures, and best practices to run containerized workload on the VMware vSAN platform.
 This reference architecture is based on vSphere 6.7 U3 and VMware Cloud Foundation 3.9, the tests are generic and applicable to any environment running Kubernetes running on vSphere.
Cloud Native Storage through the CSI driver on vSAN allows cloud native applications to persist data on all supported vSphere storage backends with the following benefits:
- Unified management for today’s and tomorrow’s applications: tight integration with Kubernetes through the use of a CSI driver for vSphere storage such as vSAN. Persistent volumes used by these applications can be easily managed by a vSphere administrator, just as easily as managing a virtual machine’s virtual disks. The tight integration with Kubernetes brings operational consistency that increases the productivity and efficiency of both developers and IT administrators:
- Consistent definition of storage policies through the integration between Kubernetes StorageClass primitive (what developers use to dynamically provision persistent volumes for their applications) and vSAN Storage Policy-Based Management policies (what vSphere administrators use to define storage policies and assign to StorageClasses).
- Consistent labeling of container volumes, allowing admins and developers to talk in the same language and to have consistent visibility into container volumes.
- Consistent method of troubleshooting between virtual machines and container volumes.
- Dynamic and automatic provisioning of persistent volumes: developers can use Kubernetes APIs that they are familiar with to provision storage resources.
- Standardizing on the deployment of storage for Kubernetes while providing the flexibility of using various Kubernetes distributions, with the consistent processes and tools in the cloud environment.
This reference architecture is a showcase of vSAN for operating and managing containerized workloads in a fully integrated SDDC environment. If you deploy the workload in VMware Cloud Foundation, it can help simplify and accelerate the necessary virtual infrastructure deployment. The key solution results are summarized as follows:
- Provide linear scalability performance capability for containerized workloads with CNS running on native Kubernetes.
- Evaluate the impact of data services in the performance testing.
- Ensure resiliency and availability against various failures and evaluate the performance impact before and after the failures.
This solution is intended for cloud-native application administrators and storage architects involved in planning, designing or administering of Kubernetes on vSAN.
The solution technology components are listed below:
- VMware vSphere
- VMware vSAN
- VMware NSX-T and NSX-T Container Plug-in
- Kubernetes 1.14.6
- MongoDB 3.6.9
Compute virtualization (VMware vSphere), storage virtualization (VMware vSAN), network virtualization (VMware NSX®) are integrated into a single platform VMware Cloud Foundation 3.9, it can be deployed on premises as a private cloud or run as a service within a public cloud. This documentation focuses on the private cloud use case.
VMware vSphere is the next-generation infrastructure for next-generation applications. It provides a powerful, flexible, and secure foundation for business agility that accelerates the digital transformation to cloud computing and promotes success in the digital economy. vSphere 6.7 Update 3 supports both existing and next-generation applications through its:
- Simplified customer experience for automation and management at scale
- Comprehensive built-in security for protecting data, infrastructure, and access
- Universal application platform for running any application anywhere
With VMware vSphere, customers can run, manage, connect, and secure their applications in a common operating environment, across clouds and devices.
VMware vSAN is the industry-leading software powering VMware’s software defined storage and HCI solution. vSAN helps customers evolve their data center without risk, control IT costs and scale to tomorrow’s business needs. vSAN, native to the market-leading hypervisor, delivers flashoptimized, secure storage for all of your critical vSphere workloads, and is built on industry-standard x86 servers and components that help lower TCO in comparison to traditional storage. It delivers the agility to easily scale IT and offers the industry’s first native HCI encryption.
In the vSAN 6.7 U3 release, it provides performance improvements and availability SLAs on all-flash configurations with deduplication enabled. Latency sensitive applications have better performance in terms of predictable I/O latencies and increased sequential I/O throughput. Rebuild times on disk and node failures are shorter, which provides better availability SLAs. The 6.7 U3 release also support cloud native storage that provides comprehensive data management for stateful applications. With Cloud Native Storage, vSphere persistent storage integrates with Kubernetes.
vSAN simplifies day-1 and day-2 operations, and customers can quickly deploy and extend cloud infrastructure and minimize maintenance disruptions. Stateful containers orchestrated by Kubernetes can leverage storage exposed by vSphere while using standard Kubernetes volume, persistent volume, and dynamic provisioning primitives.
VMware NSX-T and NSX-T Container Plug-in
VMware NSX-T Data Center is focused on emerging application frameworks and architectures that have heterogeneous endpoints and technology stacks. In addition to vSphere hypervisors, these environments include other hypervisors such as KVM, containers and bare metal.
VMware NSX-T is designed for management, operations and consumption by development organizations. NSX-T allows IT and development teams to choose the technologies best suited for their applications.
In this solution, we use VMware NSX-T to control the networking during Kubernetes operations. Simplify container networking and increase security with NSX-T, providing high availability, automated provisioning, micro-segmentation, load balancing and security policy.
NSX-T Container Plug-in (NCP) provides integration between NSX-T and container orchestrators such as Kubernetes, as well as integration between NSX-T and container-based PaaS (platform as a service) products.
The main component of NCP runs in a container and communicates with NSX Manager and with the Kubernetes control plane. NCP monitors changes to containers and other resources and manages networking resources such as logical ports, switches, routers, and security groups for the containers by calling the NSX API.
Kubernetes is an open source platform that automates container operations. It eliminates many of the manual processes involved in deploying and scaling containerized applications. Kubernetes provides a platform to schedule and run containers on clusters of physical or virtual machines.
MongoDB is a document-oriented database. The data structure is composed of field and value pairs. MongoDB documents are similar to JSON objects.
The values of fields may include other documents, arrays and arrays of documents.
The advantages of using documents are:
- Documents correspond to native data types in many programming languages.
- Embedded documents and arrays reduce the need for expensive joins.
- Dynamic schema supports fluent polymorphism.
This section introduces the resources and configurations:
- Architecture diagram
- Hardware resources
- Software resources
- Network configuration
- Virtual machine configuration
- Test tool and workload
In this solution, we deployed both underlying infrastructure, Kubernetes and cloud native storage on VMware vSAN.
With NSX-T 2.5 version, NSX-T Controller is integrated in NSX-T Manager. We deployed 3 NSX-T Managers for high availability and NSX-T Edges on VMware ESXi™ host leveraging VDS port groups.
We deployed the Kubernetes cluster on 4-node All-Flash vSAN cluster with NSX-T. Each Kubernetes cluster contains at least one VM acting as the primary node and multiple VMs acting as the worker nodes. We can deploy more than one Kubernetes cluster based on the physical resources assigned to the underlying vSphere cluster. CNS is already fully integrated into vSphere—there is nothing additional to install. Its purpose is to coordinate persistent volume operations on Kubernetes node VMs as requested by the CSI driver. Thus, you just need to install the CSI driver into Kubernetes.
Figure 1. Native (Open Source) Kubernetes on VMware vSAN Architecture
In this solution, we used a total of four all-flash servers, each configured with two vSAN disk groups, and each disk group consisting of one cache-tier write-intensive NVMe SSD and three capacity-tier read-intensive SAS SSDs.
Each ESXi host in the vSphere cluster had the following configuration:
Table 1. Hardware Configuration
Server model name
2 x Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 24 core each
2 x 10Gb NIC
1 x Dell H730 mini Adapter
Cache: 2 x 1.6TB write Intensive NVMe SSDs
Capacity: 6 x 375GB read Intensive SAS SSDs
Table 2 shows the software resources used in this solution.
Table 2. Software Resources
VMware vCenter Server and
6.7 Update 3
ESXi Cluster to host virtual machines and provide vSAN Cluster. VMware vCenter Server provides a centralized platform for managing VMware vSphere environments.
6.7 Update 3
vSAN is the key component in VMware Cloud Foundation to provide low-cost and high-performance next-generation HCI solutions.
NSX-T and NSX-T Container
NSX-T is deployed automatically by VMware Cloud Foundation. It is designed for networking management and operation.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
CentOS 7.6 is used as the guest operating system of the testing clients and Kubernetes node VMs.
YCSB Benchmark Tool
YCSB is a popular Java open-source specification and program suite developed at Yahoo to compare the relative performance of various NoSQL databases.
MongoDB is a document-oriented database. In this solution, we use MongoDB as containerized workload running in Kubernetes Pods.
Figure 3 depicts an NSX-T Edge VM hosted on one of the ESXi hosts and how it is leveraging the VDS port groups configured for different VLANs. The edge node VM had a total of 3 vNICs: one of the vNICs was dedicated for management traffic and the rest two of the interfaces were assigned to DPDK fast path; one fast path interface was used for sending overlay traffic and the other interface was used for uplink traffic towards TOR (Top of Rack) switches.
Figure 2. NSX-T Edge VM Deployed on VDS Port Groups
Some import points in Figure 3:
- VDS had dedicated pNICs, P1 and P2. The VDS port groups were used for Edge VM deployment. All the VMkernel interfaces like management, vMotion, and storage used VDS as well.
- No VLAN tagging done on Edge Uplink profile as VLAN tags were leveraged at the port group level.
- Three VLAN backed port groups were defined. Port group Edge-Mgmt-PG was configured for failover order teaming policy with pNIC P1 as active and pNIC P2 as standby.
- Port group Edge-Transport-PG was attached for only one pNIC. It was also configured for failover teaming policy.
- Port group Edge-Uplink-PG for Edge node 1 and 2 had only one pNIC each, pNIC P1 and P2 respectively.
NSX VDS (N-VDS) is the next generation virtual distributed switch installed by NSX-T Manager on Transport nodes such as ESXi. Its job is to forward traffic between components running on the transport nodes. Edge node is a transport node and has a TEP IP. Edge node sends and receives overlay traffic from compute hosts on pNIC via Edge-Transport-PG. Compute host is also a transport node and has a TEP IP like Edge node. It uses pNIC to send overlay traffic to other compute hosts and to Edge nodes.
On each physical server, two Network Interface Card (NIC) ports were used. All NIC ports were configured as NVDS. Management, vMotion and NSXT host overlay transport zone used NIC 1 as active uplink and NIC 2 as standby uplink. vSAN VMkernel network used NIC 2 as active uplink and NIC 1 as standby uplink.
NSX-T requires MTU to be at least 1,600 for the physical switches and NICs. vSAN supports MTU 1,500 and MTU 9,000 and the performance tests show that larger MTU settings can help vSAN improve throughput. Based on the above requirements, we set the MTU to 9,000 in the whole environment to reduce the physical network management complexity and achieve a higher performance.
We deployed NCP plugin that controls the NSX-T networking during Kubernetes operations. Kubernetes is divided into namespaces with each one containing multiple pods. Each namespace created in Kubernetes triggers a dedicated NSX-T segment and therefore the namespace defines a network address and IP range to be assigned for the segment.
NSX-T requires to be set up with an IP block that can be divided down to provide a new subnet for each namespace that is created. Each pod created within a namespace will then automatically receive its IP address from the corresponding subnet. All Kubernetes pods in the cluster get source NATed on Tier 1 gateway.
Each namespace that gets an IP Pool out of "IP-BLOCK" will be Source NATed to an IP address that gets picked from the “pod-NAT-POOL”. The next part of the IP addressing is for external connectivity for each service of Kubernetes service type LoadBalancer and is defined as an IP pool rather than an address block. Figure 4 shows the network configuration of Kubernetes with NSX-T.
Figure 3. Network Configuration of Kubernetes with NSX-T
Each Kubernetes node VM is configured with two virtual network adapters. vNIC1 is connected to k8s_management Logical Switch that is attached to Tier 1 router for access to k8s nodes. vNIC2 is connected to k8s_transport Logical Switch that not connect any Tier 1 router to provide a transport network.
Virtual Machine Configuration
The virtual machines were grouped by different roles on VMware vSAN. Table 4 describes the virtual machine configuration in this solution.
Table 4. Virtual Machine Configuration
Kubernetes Master Node
Kubernetes Worker Node
YCSB test Client
In this solution, we deployed NSX-T Edges on ESXi host leveraging VDS port groups. For NSX-T Edges, we used the ‘small’ profile when deployed NSX-T Edge VMs.
Test Tool and Workload
We used the YCSB benchmark tool to evaluate the performance of containerized application MongoDB. We used YCSB workload A as summarized: • Workload A (Update heavy workload): 50/50% mix of reads/writes
Some configuration parameters are in Table 5.
Table 5. YCSB Parameter setting
Threads on each client
The solution included the following tests:
- Functionality test: create and deploy a StatefulSet application with persistent volumes, CSI driver creates VMDKs that meet your application storage requirements and dynamically attaches it to one of the Kubernetes nodes in the cluster. Use vSphere Client to access and monitor the container volumes that back the application. Validate the node VM failure and Kubernetes pod failure scenarios.
- Performance test of MongoDB: validate the basic performance of running MongoDB as pods in a Kubernetes cluster with CSI volumes. And the scaling performance of MongoDB by increasing the number of MongoDB clusters from 1 to 2 to 4. Test different failure types included host failure, disk failure, and disk group failure by enabling data services including RAID 5 and deduplication/compression to verify the impact on performance.
- Backup and restore test: use Velero for Kubernetes backup and restore of the persistent volumes.
Note: See Deploying a Kubernetes Cluster on vSphere with CSI and CPI for detailed procedures.
Functionality Test—Create a StatefulSet application with Persistent Volumes
We deployed MongoDB workloads as an example to illustrate ‘storage class creation’, ‘MongoDB service creation’ and ‘StatefulSet creation’ on Kubernetes cluster.
On the primary node, verify that all nodes have joined the cluster with the proper setup of NCP plugin and CSI driver.
1. Verify that all worker nodes join the primary node and are ready.
2. Verify the NCP plugin is set on all nodes.
3. Verify vSphere Cloud Controller Manager and vsphere-csi-node is running on all nodes and ProviderID is set on all nodes.
4. Create and deploy a StorageClass that references a vSphere SPBM policy, and a Kubernetes service for the application. The service provides a networking endpoint for the application. Finally, create and deploy a StatefulSet that specifies the number of replicas to be used for application.
5. In the vSphere Client, click Monitor->Container Volumes, observe the container volumes availability and monitor the storage policy compliance status.
Functionality Test—Scale out the StatefulSet Application
We easily scaled out the StatefulSet’s replicas by using this command:
$ kubectl scale stateful set mongod –replicas=5
The persistent volume was automatically created and automatically attached to the created pod.
In the vSphere client, we also observed 5 container volumes.
When we scaled done application from 5 pods to 2, the pods that get removed are the ones with the highest number. Pod mongod-0 is the first pod created and is also the last one to get removed.
$ kubectl scale stateful set mongod –replicas=3
When we scaled down the application, the PVCs were not removed to protect data. Run the kubectl delete pvc command manually if you want to remove the persistent volumes that are no longer used by pods.
Delete the application pod and PVC
After deleting a pod, the pod will re-create and restart. Persistent volume will not be removed, the newly created pod mounts the existing PV.
You have to remove the pod if you want to remove the PVC. Otherwise, PVC will be left with a status of “Terminating” indefinitely since the PVC mongodb-persistent-storage-claim-monogd-2 is being used by the Pod mongod-2.
Functionality Test—Kubernetes Node VM Failure
Power off the node on which the pod of StatefulSet is running. After 40 seconds, the node entered a NotReady state. After another 5 minutes, the pod was marked for deletion and entered a Terminating state.
It was not scheduled on a new node, and there was no attempt to detach the PV from the current node and re-attach it to the new node. Then we used the kubectl delete node command to force delete the node that was powered off. The pod got deleted and changed from Terminating to Pending. It is waiting for the PV to get attached and mounted before it can start. Initially we noticed some errors in the pod’s events: the volume cannot be attached to the new node as it is still attached to the original node. But after 6 minutes, we see some new events happening on the pod indicating the attach of the PV succeeded.
MongoDB Performance Test
We deployed sharded MongoDB clusters on Kubernetes cluster. Each Kubernetes cluster contained 1 primary node and 3 worker nodes. For each MongoDB cluster, we deployed 1 ‘configdb statefulset’ to act as the internal configuration database of MongoDB. There were also 3 ‘shards statefulsets’ as MongoDB’s actual data nodes and 1 ‘mongos statefulset’ as the routing service of MongoDB. Make sure 3 shards statefulsets pod were distributed on different node VMs.
We used YCSB and workload type of ‘workload A (50% write/50% read)’ as described in the Test Tool and Workload section.
We started by deploying only 1 Kubernetes cluster and only 1 MongoDB cluster. Then we deployed another 1 Kubernetes cluster and MongoDB cluster to test the scalability. Finally, there were 4 clusters running and being tested in parallel.
With the number of clusters growing from 1 to 2 to 4, the throughput (ops/sec) increased from 10,751 to 17,407 and to 28,004. The increasing rate was 62% from 1 cluster to 2 clusters and 61% from 2 clusters to 4 clusters.
Meanwhile, the average read latency grew from 2.66ms to 3.38ms to 4.5ms. The increasing rate was 27% from 1 cluster to 2 clusters and 33% from 2 clusters to 4 clusters. The average write latency grew from 3.29ms to 4.14ms to 5.3ms. The increasing rate was 25% from 1 cluster to 2 clusters and 28% from 2 clusters to 4 clusters.
Figure 4. MongoDB Performance
Performance Test—Data Services
Native Kubernetes node VMs has the same operating system, but the persistent volumes containing the data had a lower deduplication ratio. We deployed 4 MongoDB clusters. Each cluster contained one primary VM and 3 worker VMs. The deduplication and compression ratio were 1.34 times and saved 926.75GB spaces.
Note: The testing results might vary based on different testbeds and workloads.
We changed the Failures to tolerate method of the persistent volumes to RAID 5. RAID 5 saved 0.53TB spaces. Evaluate data services impact on performance by comparing the data service results to the baseline results.
Figure 5. MongoDB Performance Result Comparison
From the result, the ops/sec decreased as the failure tolerance method changed from RAID 1 to RAID 5 as expected, about 18% drop in performance. Meanwhile, RAID 5 triggered the increase of the update latency. Enabling deduplication and compression had limited performance impact during the tests because more data was in cache layer.
We validated three failure scenarios for the resiliency test: disk failure, disk group failure, host failure.
In this failure test section, we used the same parameter settings of one Kubernetes cluster as the baseline test. We monitored the workload and when the workload entered steady state, we manually injected a disk error on a capacity SSD that hosted MongoDB test data and measured the performance result after the disk failure.
Repeat the test in above step, inject a disk error on a cache SSD instead, which will cause an entire disk group failure. Measure the performance after the disk group failure.
Lastly, shut down a host that runs VMs that do not host mongos (routing service of MongoDB) persistent volume. Measure the performance after the host outage.
Figure 6. Resiliency Test Result
From the result, the average performance of disk failure dropped about 5%. However, with the storage policy set with FTT=1, the object can still survive and serve IO. Thus, from the VM’s perspective, there is no interruption of service. In the disk group failure situation, vSAN disk group will fail and the permanent error object will be put in the degraded state which will automatically trigger the rebuilding on another disk group. The rebuild bandwidth will cause potential performance impact during the test. The average performance dropped about 10% due to initiation of the rebuild process.
When we manually shut down a host, there was no obvious performance drop. VMware vSphere HA will restart the impacted VMs on another host.
Because we do not shut down the host that hosts routing service of MongoDB, there is no interruption of service.
Backup and Restore Test
We backup and restore the Kubernetes cluster objects using project Velero. First, we downloaded and Installed Velero CLI to the local machine. Velero uses any S3 API-compatible object storage as its backup location. Minio is a small, easy to deploy S3 object store you can run on Kubernetes or otherwise. To install Minio on the cluster for use as a backup repository. we can use helm which is a package manager for Kubernetes– this simplifies the installation down to creating a yaml file for the configuration.
Velero can be installed via a helm chart. With Velero deployed to cluster, we use Velero to back up and restore Nginx workload application with PVs on Kubernetes cluster running CSI plugins.
From the figure, we observe specs for:
- A Nginx namespace called nginx-example
- A LoadBalancer Service that exposes port 80
- A 50MB persistent volume claim, using the nginxn-sc StorageClass
The NGINX welcome page is displayed.
We perform a backup for nginx namespace using the velero command line client:
Now delete the nginx-example Namespace, this will delete everything in the Namespace including the Load Balancer and Persistent Volume:
We can now perform the restore procedure, once again using the velero client:
Here we use “create” to create a velero restore object from the nginx-backup, and then we can check the status of the restored object.
Velero can also be used to schedule regular backups of Kubernetes cluster for disaster recovery. To do this, you can use the velero schedule command. Velero can also be used to migrate resources from one cluster to another.
The following recommendations provide the best practices and sizing guidance to run containerized application on VMware vSAN with Cloud Native Storage.
- Compute consideration:
- Enable HA and DRS feature of the cluster
- Use VM/Host rules to force the VMs of NSX-T Edge cluster and worker nodes of each Kubernetes cluster reside on separate physical hosts
- Use pod-level Anti-Affinity rule to force shards statefulsets pods distributed on different node VMs
- Network consideration:
- Separate physical network using different network range and VLAN ID for management, vSphere vMotion, vSAN, VM Network and Host overlay network o Set MTU equals to 9,000 for all the physical switches
- Storage consideration:
- Optional to enable deduplication and compression feature to save space o For different type of workload, create a different vSAN storage policy for demands o Consider using different “Failures to tolerate” options with RAID 5 or RAID 6 to improve space efficiency and protection Conclusion
The integration between Kubernetes and Cloud Native Storage through the CSI driver on vSAN enables developers to provision persistent storage for
Kubernetes on vSphere on-demand in an automated fashion. It also allows IT administrators to manage container volumes through the Cloud Native Storage UI within VMware vCenter as if they were VM volumes. Developers and IT administrators can now have a consistent view of container volumes and troubleshoot at the same level.
Deploying Kubernetes at scale is made simple through the integration between open source Kubernetes and VMware SDDC technologies, which delivers consistent VM and container management and allows developers and IT administrators to collaborate on building and running modern applications.
About the Author
Yimeng Liu, Solutions Architect in the Solutions Architecture team of the Hyperconverged Infrastructure (HCI) Business Unit, wrote the original version of this paper. Ka Kit Wong, Victor Chen, Myles Gray, Rachel Zheng, and Kyle Gleed of the HCI Business Unit in VMware also contributed to the paper contents.