March 22, 2023

Enhanced Performance Diagnostics tools for vSAN 8 U1

Customer satisfaction is paramount for VMware™.  VMware vSAN™ 8 U1 introduces three improvements to help ensure important system data is readily available for use in troubleshooting issues.  First, we improved the usability of our “Performance for Support” dashboard that can helps show a wide variety of metrics quickly across hosts.  Next, we’ve improved the space allocation for trace objects, the location that we store trace files so that VMware’s technical support teams can more accurately detect via logs what occurred in a cluster. And finally, we improved our network testing and health checks to provide more helpful information for both vSAN OSA and ESA cluster types.

 

Performance for Support

The “Performance for Support” view was designed primarily as a tool for deeper-level metrics for troubleshooting performance and stability of a vSAN cluster.  Since its debut in vSAN 6.6.1, where it was intended to replace vSAN Observer, it is used by GS and customers alike for viewing performance statistics easily across hosts in a cluster, or viewing helpful metrics like vSAN Memory and CPU consumption.  Feedback has been given that indicated improvement in this capability would be helpful for both customers and GS.

image-20230322124558-2

 

vSAN 8 U1 addresses the issued raised, and brings the following improvements:

  • Improved performance of the UI.  The default display renders statistics for many hosts at the same time.  Some of this information was loaded serially.
  • Easier Navigation.  UI enhancements will help navigating the metrics and hosts in a more intuitive manner.

Trace Objects

Performance diagnostics are captured as traces, in a trace file, and used by our Technical Support Teams (SRE’s in GS) to diagnose issues.  The allocated amount is relatively small and generally not resilient as it depended on the host where the traces are executed.  This matter is made worse by hosts configured with only USB or SD cards for boot media, where the traces needed to be stored in a non-persistent RAM disk.  The existing limitations made trace files not very useful if the file resides on a host is offline, or ran out of capacity to persist the relevant data.   The inability to store enough trace data and in a resilient, durable way has had a negative impact in root-cause analysis efforts and prescriptive resolutions by GS, and overall satisfaction from customers experiencing issues.

In vSAN 8 U1, vSAN will now provide the ability to store more trace log data, and do so resiliently.  While it still collects trace files under a local log location, it will now periodically dump the trace file information into a dedicated object.  The new approach will only be used if the “/vsantraces” log location is left unchanged (indicating that it is NOT using external storage).  It will respect the custom path if already entered.

  • •The trace data will be replicated or backed up to a dedicated object location on the vSAN datastore at 1 minute intervals.
  • •The trace data replicated to the datastore will be automatically purged after 6 days (or best effort) of storage. 
  • •The limit of replicated tracelog storage is now 512GB, which should be sufficient to store tracelogs for a 32 host cluster for up to about 6 days.
  • This feature is using the new custom namespace object capability also introduced in vSAN 8 U1, as described earlier in this presentation

Network Testing and Health Check Improvements

The vSAN health UI now offers both a list and a tile view.

image-20230322122605-3

 

As before the "Ask VMware" link will direct you to a relevant KB. Remediation for a number of health checks can be automated from within the vSAN Skyline Health UI without the need for tedious manual steps.

Tile View

This view also includes not only recommended fixes, but alternative fixes that may be needed to operate in air gap environments.

image-20230322122440-2

The vSAN “Proactive Tests” (found in Monitor > vSAN > Proactive Tests) is often used to determine if there are systemic connectivity issues between hosts. Given the introduction of the ESA, and it’s different network requirement, when the test is run on an ESA cluster, it removes the target NIC speed used, and will simply attempt and render its maximum throughput.  This helps eliminate the potential confusion of what the network connectivity is capable of.

The test also has a new pop-up warning users that running the test may impact the current workloads, giving the administrator an opportunity to reconsider if the test is on a cluster running production workloads.

image-20230322122927-5

 

The test also gives much more information in the “Diagnose the issue” area of the screen if the test results in any warnings or errors (non-”passed” state). The Receiving throughput of the networking test is now known allowing the operator to make an objective judgement if the bandwidth will be sufficent or is close enough to the wirespeed of the link.

image-20230322123019-6

The Skyline Health for vSAN “Network latency check” simplifies the test results of ping tests across hosts comprising a cluster. Now an overview is reported unless any individual host is experiencing high latency and it is  broken out into a discrete "failed networking checks". In addition The Skyline Health for vSAN “MTU check (Ping with large packet size)” will now render actual packet size tested (e.g. 9,000).

image-20230322123554-7

 

 

Low level operation and support improvements are critical to preventing outages, and delivering optimal performance out of a vSAN cluster. Combined with other monitoring improvements in vSAN 8 Update 1, this release improves upon the best in class HCI operational experience that VSAN delivers.

Filter Tags

Storage vSAN vSAN 8 Blog What's New Intermediate Manage