vSAN Health Check Correlation
Skyline Health has been very instrumental in assisting customers to triage vSphere\vSAN environments for configuration consistency, best practice information, vulnerability discovery, and proactive analytics. Over the years, VMware has demonstrated a commitment to simplifying operations by significantly enhancing the features and capabilities of health checks and guidance. The improvements to Skyline Health in vSAN 7 Update 3 allow administrators to find the root cause of the issue more quickly expediting the time to resolution.
In vSAN 7 Update 2 VMWare introduced Health Check History allowing administrators to view health issues by date. The health check history can be viewed in a timeline to give a better understanding of the previous states of an alarm. This ability to specify a date or date range to view health events can be very insightful in understanding and resolving issues.
vSAN Health Check Correlation
Skyline health plays a critical role in daily SDDC operation, which monitors an entire cluster and can trigger VC alarms proactively once an issue is detected. Administrators use these alarms to troubleshoot and resolve potential issues. However, there are scenarios where one fundamental issue can cause several health alarms to go red. Looking at the health page, it can be challenging to determine which alarm should be addressed first. In addition, there are cases where multiple health warnings can have a single root cause making it complicated for troubleshooting.
vSAN 7 U3 introduces a correlation between many of the health checks provided in Skyline health. Health check correlation establishes a relationship or a dependency between some health checks and others. This relationship allows administrators to determine and address the root cause, thereby eliminating many of the other health check alerts, which simply may have been symptoms of the primary cause. This functionality is available via API as well, so it can be used in solutions such as Aria Operations.
There are two places to view correlated health checks, clicking on the actual errors, or on the overview section. In the overview section illustrated above, a given health check will be labeled as the “Primary issue” with a list of “likely impacts” that will reflect the triggered health checks that have a relationship to this primary issue. In this example, there is a misconfigured vmknic, causing several errors and warnings. The vSAN health check determined the "Primary Issue" is related to vSAN vmknic configuration, and the "Likely Impacts" are cluster partition and a failed vSAN health status. Since the primary issue is the misconfigured vmknic, addressing this primary issue will resolve the other issues.
Also new in vSAN 7 U3 is the notification of the "Primary Issue" seen at the top of the page when clicking on an individual error. (see below).
Clicking on the link in the notification we can see vSAN has identified the host with the misconfigured vmknic is 10.198.17.63.
In this example, there were several errors and warnings that all resulted in one misconfigured vmknic. Identifying this as the root cause expedited the time to triage the issue, and making the necessary changes resolved all other issues in the cluster.
The Skyline health checks for a vSAN cluster is an impressive list of detailed health checks to catch the most common issues such as misconfigurations and failures. New to vSAN 7 U3 is the ability to understand the relationship of one health check to another for fast and effective troubleshooting. Understanding relationships of multiple triggered health checks will allow one to address the root cause more quickly, which may correct most or all of the triggered alerts. For more information on Skyline Health and all aspects of vSAN go to core.vmware.com.