High Resolution Performance Monitoring in vSAN 8 U1
Time-based performance metrics are an important way to identify the amount of activity occurring on a VM, a group of VMs, hosts, or an entire cluster, and determine if the environment is delivering the performance levels expected. vCenter Server, augmented by the vSAN Performance Service has always offered the ideal monitoring plane for environments powered by vSphere and vSAN, as it measures the right data in the right way from the right location: The hypervisor.
vSAN 8 U1 introduces new enhancements to give our customers enhanced time-based performance metrics for better monitoring and improved troubleshooting. These enhancements apply to both the Express Storage Architecture (ESA) and the Original Storage Architecture (OSA).
Increased Sampling Rate for Improved Performance Monitoring
The most common method of time-based measurement of systems is by using sampling intervals. Using counters, metric collection processes will determine how much activity occurred over a given sampling interval, and display that as a single value that represents an average of the activity measured for the specific period. These contiguous samples are collected and rendered on a screen as a line, representing the activity over time.
Historically the vSAN Performance Service collected and rendered performance metrics at a 5-minute rate, or interval. A 5-minute sampling rate was used so that vSAN could collect a broad range of performance metrics for easy viewing of larger time windows, such as 8 hours, or 24 hours. This longer sampling interval also allowed vSAN to keep data processing and storage of those metrics to a minimum. While this approach worked well for general monitoring, it wasn’t ideal for viewing I/O activity in shorter time windows such as 1 hour, as it tended to mask extreme changes in the activity within a single sampling period. In other words, it would sometimes smooth out bursts of I/O activity or latency.
The vSAN Performance Service in vSAN 8 U1 improves this 5-minute sampling interval to now include a "real-time" time range option in the user interface. When this is selected, the monitoring of performance metrics collects and renders them at a much shorter, more frequent 30-second sampling interval. This shorter interval will produce time-based performance graphs that are more representative of system behavior, catching bursts of I/O or latency that would otherwise be smoothed out over a larger sampling period.
Let’s look at a simple example of a small test lab with activity generated from a single VM.
In the example shown in Figure 1, we see I/O activity viewed at a 1-hour time window and collected using the 5-minute sampling interval used in past versions of vSAN.
Figure 1. 5-minute sampling interval in vSAN 8 U1
In the example shown in Figure 2, let’s view the very same I/O activity from the VM used in the first example, viewed at the same 1-hour time window, but collected and rendered at a 30-second sampling interval.
Figure 2. 30 second sampling interval in vSAN 8 U1
Comparing the two examples clearly shows how the shorter sampling interval offers much more detail than the longer 5-minute sampling interval. It yields a “higher resolution” of performance metrics that allows administrators to see behavior not possible with the longer sampling interval. Both are equally accurate for how the measurement is defined, but more data points using a shorter sampling interval return a graph that is much more representative of the rapidly changing behavior common with workloads. It can be especially useful for identifying and troubleshooting anomalies.
Accommodations and Limitations
Since this type of data collection will consume more resources, we've made a few changes to vSAN to help accommodate this capability.
- Different retention periods. The metrics collected at this much higher sampling rate will be retained for 7 days, and visible in the vSphere Client for the past hour. Metrics collection for using the 5-minute interval will remain the same as past versions, retained for about 90 days, depending on circumstances.
- More capacity. We've increased the amount of potential capacity that the statsDB object can store. This has been increased from 255GB to 512GB, and is adjusted automatically when upgrading to vSAN 8 U1.
- Separate, purpose-built processes. Each host will use a purpose-built collector to collect its high-resolution metrics, and store them in the same statsDB object as the 5-minute intervals collected by the primary collector. It will store only the host and device-level metrics at these higher sampling intervals. Other high-resolution metrics like VM-level performance data will not be saved to the statsDB object.
The 30-second sampling rate will not be available when viewing the performance of a discrete VM or VMDKs. Aggregate VM statistics performance statistics using the new 30-second “real-time” sampling interval can be found at the host level, and the cluster level. If you would like to capture highly precise I/O metrics against one or more VMs, VMware IOInsight will offer a detailed collection of VM metrics. It can be found in vCenter Server by highlighting the cluster, clicking on Monitor > vSAN > Performance, and selecting IOInsight.
Summary
Performance metrics that use shorter sampling intervals provide a better representation of actual VM and system behavior. The new high-resolution, 30-second sampling interval in vSAN 8 U1 gives you better clarity on the activity in your environment for more informed decision making.