Running VMware Tanzu RabbitMQ on VMware Tanzu Kubernetes Grid
Note: This solution provides general design and deployment guidelines for running VMware Tanzu RabbitMQ® for Kubernetes on VMware Tanzu® Kubernetes Grid™. It is showcased in this paper running on Dell EMC VxRail. The reference architecture applies to any compatible hardware platforms running Tanzu Kubernetes Grid.
Executive Summary
Business Case
Messaging and streaming software serves as the central nervous system integrating distributed applications and services. VMware Tanzu RabbitMQ® is a messaging and streaming service, allowing customers to deploy and manage RabbitMQ at scale. Tanzu RabbitMQ makes it easy to architect resilient and loosely coupled systems. Application developers can connect systems using a variety of protocols and formats without worrying about interoperability, reliability, or scalability.
Based on open source RabbitMQ, Tanzu RabbitMQ provides additional levels of automation for Day 0, Day 1, and Day 2 operations, allowing your application teams to focus on delivering value to customers. Tanzu RabbitMQ for Kubernetes provides a secure messaging and streaming service. It offers low-touch, automated operation, and developer self-service across Kubernetes clusters. It can be deployed on premises and in the public cloud, using any type of Kubernetes.
VMware Tanzu Kubernetes Grid builds on trusted upstream and community projects and delivers a Kubernetes platform that is engineered and supported by VMware. You can deploy Tanzu Kubernetes Grid across several platforms including VMware vSphere®, Microsoft Azure, and Amazon EC2.
Dell EMC VxRail™, powered by next generation Dell EMC PowerEdge server platforms and VxRail HCI System Software, features next-generation technology to future proof your infrastructure and enables deep integration across the VMware ecosystem. Advanced VMware hybrid cloud integration and automation simplifies the deployment of secure VxRail cloud infrastructure.
Technology Overview
The technology components in this solution are:
- VMware Tanzu Kubernetes Grid
- Dell EMC VxRail
- Tanzu RabbitMQ for Kubernetes
VMware Tanzu Kubernetes Grid
VMware Tanzu Kubernetes Grid provides organizations with a consistent, upstream-compatible, regional Kubernetes substrate that is ready for end-user workloads and ecosystem integrations. You can deploy Tanzu Kubernetes Grid across software-defined datacenters (SDDC) and public cloud environments, including vSphere, Microsoft Azure, and Amazon EC2.
Tanzu Kubernetes Grid provides the services such as networking, authentication, ingress control, and logging that a production Kubernetes environment requires. It can simplify operations of large-scale, multi-cluster Kubernetes environments, and keep your workloads properly isolated. Also, it automates lifecycle management to reduce your risk and shift your focus to more strategic work.
Dell EMC VxRail
The only fully integrated, pre-configured, and pre-tested VMware hyperconverged integrated system optimized for VMware vSAN™ and VMware Cloud Foundation™, VxRail provides a simple, cost effective hyperconverged solution that solves a wide range of operational and environmental challenges and supports almost any use case, including tier-one applications, cloud native and mixed workloads. Powered by next generation Dell EMC PowerEdge server platforms and VxRail HCI System Software, VxRail features next-generation technology to future proof your infrastructure and enables deep integration across the VMware ecosystem. The advanced VMware hybrid cloud integration and automation simplifies the deployment of secure VxRail cloud infrastructure.
Tanzu RabbitMQ for Kubernetes
Tanzu RabbitMQ is available with a custom Kubernetes cluster operator that makes the widely popular messaging broker easy for developers to access and consistent for platform operators to manage with any certified Kubernetes runtime. Tanzu RabbitMQ on Kubernetes offers stable, predictable, messaging and streaming and requires a minimum of configuration.
Tanzu RabbitMQ can automate provisioning and upgrades and manage messaging and streaming topologies even across datacenters. Developers can manage their messaging systems with the same low-code approach virtually anywhere, leveraging simple and fast deployment to Kubernetes, low levels of DevOps maintenance and access to the Kubernetes ecosystem for logging, monitoring, and tracking.
Tanzu RabbitMQ provides Prometheus and Grafana monitoring plugin to expose metrics, improving the overall observability of RabbitMQ. Visualizing these metrics is as simple as importing the pre-built dashboards into Grafana. And it can be proactive with cluster and alerts capability.
After the RabbitMQ cluster is operational, backing up its data held within it becomes an important and ongoing administrative task. A data backup/restore strategy is required for data security and disaster recovery planning and other tasks like off-site data analysis or application load testing. VMware provides the Velero plugin for vSphere to back up and restore RabbitMQ deployments on Kubernetes.
Test Tools
We leverage the following monitoring and benchmark tools in the scope of our functional validation of Tanzu RabbitMQ on Tanzu Kubernetes Grid.
Monitoring Tools
vSAN Performance Service
vSAN Performance Service is used to monitor the performance of the vSAN environment through the vSphere Client. The performance service collects and analyzes performance statistics and displays the data in a graphical format. You can use the performance charts to manage your workload and determine the root cause of problems.
Prometheus and Grafana
Prometheus is an open-source system monitoring and alerting toolkit. It can collect metrics from target clusters at specified intervals, evaluate rule expressions, display the results, and trigger alerts if certain conditions arise.
Grafana is open-source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In other words, Grafana provides you with tools to turn your time-series database (TSDB) data into high-quality graphs and visualizations.
Workload Generation and Testing Tools
RabbitMQ PerfTest
RabbitMQ has a throughput testing tool, PerfTest, based on the Java client and can be configured to simulate basic workloads and more advanced workloads. In addition, PerfTest has extra tools that produce HTML graphs of the output.
A RabbitMQ cluster can be limited by several factors, from infrastructure-level constraints (for example, network bandwidth) to RabbitMQ configuration and topology to applications that publish and consume. PerfTest can demonstrate the baseline performance of a node or a cluster of nodes.
Solution Configuration
This section introduces the resources and configurations:
- Architecture diagram
- Hardware resources
- Software resources
Architecture Diagram
In this solution, we deployed the Tanzu RabbitMQ test environment using Tanzu Kubernetes Grid on a 4-node VxRail P570F cluster. We used Tanzu Kubernetes Grid on vSphere, and Ubuntu 20.04 is required as the Tanzu Kubernetes Grid OVA. Figure 1 shows the solution architecture designed to run the Tanzu RabbitMQ cluster on Tanzu Kubernetes Grid.
Figure 1. Tanzu RabbitMQ Running on Tanzu Kubernetes Grid
Perform the following steps for deployment and configuration:
- Deploy Tanzu Kubernetes Grid management cluster with the installer interface.
- Create a configuration file for the workload cluster and deploy a Tanzu Kubernetes Grid workload cluster from the configuration file.
- Customize settings in the configuration file, such as ‘WORKER_MACHINE_COUNT’ and ‘CONTROL_PLANE_MACHINE_COUNT’.
Table 1 shows the deployment of Tanzu Kubernetes Grid management and workload cluster. In the configuration file, we defined the size and the number of Tanzu Kubernetes Grid workers.
Table 1. VMs Configuration
VM Role |
vCPU |
Memory (GB) |
VM Count |
Tanzu Kubernetes Grid Management cluster – Control Plane VM |
2 |
8 |
1 |
Tanzu Kubernetes Grid Management cluster – Worker node VM |
2 |
8 |
|
Tanzu Kubernetes Grid cluster (workload cluster) – Control Plane |
4 |
32 |
3 |
Tanzu Kubernetes Grid cluster (workload cluster) – Worker node VM |
16 |
64 |
7 |
RabbitMQ Pod Cluster can be deployed through the RabbitMQ Cluster Kubernetes Operator. It can automate the provisioning, management, and operations of RabbitMQ clusters running on Kubernetes, allowing specific RabbitMQ configuration options and commands to be managed via the Kubernetes API.
Tanzu RabbitMQ operator specifies the desired deployment state. It consists of a Customer Resource Definition (CRD) and a custom controller. A custom resource extends the basic capabilities of Kubernetes and can be managed the same as any other Kubernetes objects. A CRD file defines these objects in YAML files for Kubernetes to create and watch. You can also install the Tanzu RabbitMQ cluster to fetch the image from the local registry, such as Harbor.
Figure 2. How the Tanzu RabbitMQ Operator Reconciles State
After the deployment is successful, you can obtain the RabbitMQ cluster as shown below.
Figure 3. Deploying RabbitMQ
For the RabbitMQ instance, we can customize the configuration for better performance. In this solution, we used 3 RabbitMQ nodes to validate. It is recommended to use clusters with an odd number of nodes so that when one node becomes unavailable, the service remains available, and a majority of nodes can be identified. Each RabbitMQ node is requested with 4 vCPUs and 8GB RAM.
Hardware Resources
In this solution, we used a total of four VxRail P570F nodes. Each server was configured with two disk groups, and each disk group consisted of one cache-tier write-intensive SAS SSD and four capacity-tier read-intensive SAS SSDs. Each VxRail node in the cluster had the configuration as shown in Table 2.
Table 2. Hardware Configuration for VxRail
PROPERTY |
SPECIFICATION |
VxRail node |
VxRail P570F |
CPU |
2 x Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz, 28 cores each |
RAM |
512 GB |
Network adapter |
2 x Broadcom BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller |
Storage adapter |
1 x Dell HBA330 Adapter |
Disks |
Cache: 2 x 800GB write-intensive SAS SSDs Capacity: 8 x 3.84TB read-intensive SAS SSDs |
Software Resources
Table 3 shows the software resources used in this solution.
Table 3. Software Resources
Software |
Version |
Purpose |
VMware vCenter Server and ESXi |
7.0 Update 2 |
VMware vSphere is a suite of products: vCenter Server and ESXi. |
VMware vSAN |
7.0 Update 2 |
vSAN is the storage component to provide low-cost and high-performance next-generation HCI solutions. |
VMware Tanzu CLI |
1.3.1 |
Command line interface that allows deploying CNCF conformant Kubernetes clusters to vSphere and other cloud infrastructure. |
Tanzu RabbitMQ for Kubernetes |
1.1.1 |
Based on the widely popular open source RabbitMQ messaging system, Tanzu RabbitMQ for Kubernetes is designed to work seamlessly with Tanzu and also run on any certified Kubernetes distribution. |
RabbitMQ PerfTest |
v2.15.0 |
A throughput testing tool that is based on the Java client and can be configured to simulate basic workloads and more advanced workloads as well. |
Cloud Native Runtimes for Tanzu |
1.0.2 |
A serverless application runtime for Kubernetes that is based on Knative and runs on a single Kubernetes cluster. |
Dell EMC VxRail HCI System Software |
7.0.200 |
Turnkey Hyperconverged Infrastructure for hybrid cloud. |
Kubernetes |
v1.20.5 |
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. |
Kubernetes OVA for Tanzu Kubernetes Grid |
Ubuntu 20.04 |
A base image template for the Kubernetes Operating System of Tanzu Kubernetes Grid management and workload clusters. |
Velero |
1.5.4 |
An open source community standard tool for backing up and restoring Kubernetes cluster objects and persistent volumes. |
Solution Validation
This solution testing is a showcase of Tanzu RabbitMQ running on Tanzu Kubernetes Grid for performance, backup, restoring, observability, serverless application, and resiliency.
The validation covers the following scenarios:
- Load testing: Demonstrates the baseline performance of a cluster of nodes by PerfTest. When the PerfTest is running, monitor various publisher and consumer metrics provided by the management UI and Grafana.
- Backup and restore: Validates how to back up and restore a RabbitMQ deployment on Tanzu Kubernetes using Velero, an open-source Kubernetes backup and restore tool.
- Observability: Shows how to configure Prometheus to monitor the RabbitMQ cluster, especially alerting capabilities.
- Cloud Native Runtimes for Tanzu: Shows how to install cloud native runtimes for Tanzu, create a broker, producer, and consumer to use RabbitMQ and verify the Knative eventing[2] using RabbitMQ.
- Resiliency: Improves the availability and disaster recovery capabilities of your system by implementing the passive/active topology.
Load Testing
RabbitMQ PerfTest tool that ships as part of the RabbitMQ Java client library was used to generate loads for the cluster. PerfTest was simultaneously run on a set of RabbitMQ cluster nodes. In this load testing, we provisioned a cluster of pods, each with 4 vCPUs and 8GB of RAM.
We simulated high loads and reduced the number of publishers to 60 with 1kB message size, 12 classic queues were created and spread across the cluster's nodes. Each queue will have 5 consumers and 5 producers sending messages to it. The average ingress number was 202,092 messages per second and the average egress number was 197,519 messages per second. Based on 1 quorum queue with 5 producers and 5 consumers, each with its own dedicated connection and channel, 3 queue replicas with 8 kB message size, the average rate could achieve 17,885 messages per second (1.5 billion messages per day). The RabbitMQ cluster remained stable after one hour’s loading running. The performance testing was done multiple times and the RabbitMQ cluster kept stable.
Back up and Restore RabbitMQ Deployments on Tanzu Kubernetes Grid
Velero is an open source community standard tool for backing up and restoring Kubernetes cluster objects and persistent volumes; Velero supports a variety of storage providers to store its backups. In this scenario, we used Velero to back up and restore the Tanzu Kubernetes (workload) cluster:
- Install the object storage plugin, backups and volume snapshots are saved to the same storage location[3] on vSphere cluster object storage. This location must be S3-compatible external storage or an S3 provider such as MinIO, so we use MinIO as the object storage.
-
- Run the ‘
velero install --provider aws --plugins "velero/velero-plugin-for-aws:v1.1.0" --bucket velero --secret-file ./credentials-velero --backup-location-config "region=minio,s3ForcePathStyle=true,s3Url=minio_server_url" --snapshot-location-config region="default"’
command
- Run the ‘
-
- Install the Velero plugin for vSphere:
-
- Run the ‘
velero plugin add vsphereveleroplugin/velero-plugin-for-vsphere:1.1.0
’ command, you can check the status of velero plugin after you deploy it.
- Run the ‘
-
Figure 4. Installing the Velero Plugin for vSphere
- Back up a RabbitMQ cluster and it will trigger a snapshot operation for each of the 3 volumes used by the RabbitMQ cluster.
Figure 5. Backing up the RabbitMQ Cluster
- After the backup is completed, we can restore the RabbitMQ cluster from the backup. As we can see, the download phase is changed to completed. Velero restore will be marked as completed when the volume snapshots and other Kubernetes metadata have been successfully restored. 3 backups will be downloaded and restored. For restoring from each volume snapshot, a clonefromsnapshot will be created. We can get all the clonefromsnapshots in the same namespace as the PVC.
Monitoring RabbitMQ with Prometheus Operator and Alerting
For production systems, it is critical to enable RabbitMQ cluster monitoring. Cluster Operator deploys RabbitMQ clusters with the rabbitmq_prometheus_plugin enabled. The plugin exposes a prometheus compatible metrics endpoint. These metrics provide deep insights into the state of RabbitMQ nodes. Collected metrics are not very useful unless they are visualized. RabbitMQ provides a prebuilt set of Grafana dashboards that visualize a large number of available RabbitMQ and runtime metrics in context-specific ways.
- Install Prometheus and Grafana via Prometheus operator that provides Kubernetes native deployment and management of Prometheus and related monitoring components. It simplifies and automates the configuration of a Prometheus based monitoring stack for Kubernetes clusters.
Figure 6. Installing Prometheus and Grafana
- Navigate to http://[localhost]:3000/dashboards in a web browser. It shows as follows:
Figure 7. RabbitMQ Overview
- Prometheus Operator detects ServiceMonitor and PodMonitor objects, automatically configures and reloads Prometheus’ scrape config. Open the Prometheus Web UI and navigate to the ‘StatusTargets’ page where you should see an entry for the cluster operator and an entry for each deployed RabbitMQ cluster.
Figure 8. Prometheus Operator
- To trigger the NoMajorityOfNodesReady alert, we will stop the RabbitMQ application on two out of three nodes. Within 2 minutes, two out of three RabbitMQ nodes will be shown as not Ready.
Figure 9. Shutdown RabbitMQ Application
- To see the NoMajorityOfNodesReady alert triggered in Prometheus, we opened the Prometheus UI in our browser: http://localhost:9090/alerts. We forwarded the local port 9090 to Prometheus port 9090 running inside Kubernetes and check the state (firing). It will send an alert to AlertManager. Meanwhile, the alert will appear at the RabbitMQ-Alerts Grafana Dashboard.
Figure 10. NoMajorityOfNodesReady Alert
Cloud Native Runtimes for Tanzu
Cloud Native Runtimes (CNR) for Tanzu is a serverless application runtime for Kubernetes that is based on Knative and runs on a single Kubernetes cluster. It can integrate Cloud Native Runtimes with RabbitMQ as an event source to react to messages sent to a RabbitMQ exchange or as an event broker to distribute events within your app. CNR supports event-driven apps by using Knative.
For example, you can configure a RabbitMQ event source on Cloud Native Runtimes to generate an event ‘Hello world’ once a minute and forward that event for trigger matching, and then you could see the event on the consumer logs.
To verify Knative eventing, create a broker, producer, trigger, and consumer in the new namespace and check that the event appears in the consumer logs:
Figure 11. Events are Shown on the Consumer Logs
Tanzu RabbitMQ Active/Passive
To improve the availability and disaster recovery capabilities of a system running in your data center, you can implement an active/standby topology across two data centers. Tanzu RabbitMQ for Kubernetes provides a schema synchronization plugin, every queue/exchange created on the active site is replicated to the standby site. Perform the following procedures to deploy active/passive topology:
- Developer or Operator configures which queues are replicated by creating the appropriate replication policy. Every queue matched, by at least one replication policy, is automatically replicated.
- Messages sent to a replicated queue and acknowledged are recorded.
- When the standby cluster connects to the active cluster, it downloads all the recorded messages and acknowledgments.
- When the standby cluster is promoted to active, all unacknowledged messages are restored on their corresponding queue and the Advanced Message Queuing Protocol (AMQP) traffic is restored.
Figure 12. Deploying Active/Passive Topology
Production Criteria Recommendations
The following recommendations provide the best practices and sizing guidance to run Tanzu RabbitMQ clusters on Tanzu Kubernetes Grid.
- Tanzu Kubernetes Grid:
- Use Ubuntu 20.04 image to create Tanzu Kubernetes Grid management and workload cluster that runs Tanzu RabbitMQ cluster.
- Create the Tanzu Kubernetes Cluster with a minimum of 100GB disk size for each worker nodes.
- Customize and pre-allocate enough CPU and memory resources for the Tanzu Kubernetes Cluster. Refer to Performance Best Practices for Kubernetes with VMware Tanzu for sizing guidance for Tanzu Kubernetes Grid.
- vSAN Storage:
- vSAN supports the dynamic persistent volume provisioning with Storage Policy Based Management (SPBM).
- Failure to Tolerate (FTT) is recommended to set to 1 failure - RAID-1 (Mirroring).
- Enable vSAN Trim/Unmap to allow space reclamation for persistent volumes
- Tanzu RabbitMQ for Kubernetes:
- It is recommended to use clusters with an odd number of nodes (3, 5, 7, and so on) so that when one node becomes unavailable, the service remains available and a clear majority of nodes can be identified.
- Some plug-ins might consume a lot of CPU or use a high amount of RAM, so it is not recommended to enable all plug-ins for a production server. Disable plug-ins that are not in use.
- Use the latest stable RabbitMQ and Erlang version.
- Classic queues are faster than replicated queues (mirrored/quorum). But when you need data safety and high availability, you can choose quorum queues. The larger the replication factor is, the slower the queue will be.
Conclusion
Running Tanzu RabbitMQ on Tanzu Kubernetes Grid on VxRail is a simplified and fast way to get started with the modernized workloads running on Kubernetes. It allows running modern containerized workloads using the existing IT infrastructure and processes, whereas developers innovate and build with the agility of Kubernetes and IT administrators manage the secure workloads in their familiar vSphere environment.
In this solution, we deployed Tanzu RabbitMQ clusters on Tanzu Kubernetes Grid that provides the simplified operation of servicing cloud native workloads and can scale without compromise. IT administrators can implement the policy for namespaces and manage access and quota allocation for application-focused management. This helps build a developer-ready infrastructure with enterprise-grade Kubernetes with advanced governance, reliability, and security. Tanzu RabbitMQ for Kubernetes provides monitoring and management plugins to observe and optimize clusters and messaging topologies with the predefined metrics and dashboards. Running Tanzu RabbitMQ on Tanzu Kubernetes Grid provides automated deployment, full observability, and fast time to recovery; thus, the solution increases your business continuity and security in any environment.
References
About the Author
Yimeng Liu, Solutions Architect in the Solutions Architecture team of the Cloud Platform Business Unit, wrote the original version of this paper. The following reviewers also contributed to the paper contents:
- Vic Dery, Sr. Principal Engineer of VxRail Technical Marketing in Dell EMC
- Kathleen Cintorrino, Senior Advisor Product Marketing for HCI and Ci in Dell EMC
- Jason Marques, Sr. Principal Engineer of VxRail Technical Marketing in Dell EMC
- Wayne Lund, Advisory Solutions Engineer in VMware
- Ed ByFord, Manager of Tanzu Product Management in VMware
- Yaron Parasol, Senior Product Manager of Tanzu Product Management in VMware
- Ka Kit Wong, Staff Solutions Architect of the Cloud Infrastructure Big Group in VMware
- Myles Gray, Staff Technical Marketing Architect of the Cloud Platform Business Unit in VMware
- Mark Xu, Solutions Architect of the Cloud Infrastructure Big Group in VMware
[1] If you select Development, the installer deploys a management cluster with a single control plane node. If you select Production, the installer deploys a highly available management cluster with three control plane nodes.
[2] Note: The Knative eventing functionality is in beta. VMware does not recommend using Knative eventing functionality in a production environment.
[3] We saved the backups and volume snapshots to the same storage location for the convenience of this function testing. In production, it is recommended to save the backup data elsewhere.