]

Solution

  • Lifecycle Management

Category

  • VCF Operational Guidance

Product

  • Cloud Foundation

Preparing for a VMware Cloud Foundation BOM upgrade

 

As VMware Cloud Foundation upgrades can be quite large in terms of the Bill of Materials (BOM) that may require upgrading, a common question that I get asked is what can a VCF administrator do to ensure the environment is in a healthy state to give an upgrade through SDDC Manager's LCM service the best opportunity to successfully complete.

 

My normal recommendations are the following:

  • A log support bundles for any Workload Domain, taken prior to the upgrade attempt.
  • SoS health-check.
  • Running the LCM pre-check in the UI for each Workload Domain.
  • A password sanity check against all SDDC Manager managed components
  • Use vCenter Web Client to review any alerts.
  • Check the Release Notes.

 

Below I am going to demonstrate these steps and outline the importance each of these have in the health validation.

Disclaimer: This is a nested lab and some of the warnings and errors reported are expected in this type of environment. All of the information below pertains to VMware Cloud Foundation 3.10.

 

Collecting support bundles

 

Firstly, gathering an SoS bundle may sound like overkill due to the sheer amount of logging that might need to be gathered based on the size of the environment and will require a judgement call to be made if this is feasible to meet a specific maintenance window.

 

With that said, capturing data before a major upgrade can be crucial for root cause analysis. For some VMware appliance upgrades like vCenter Server, SDDC Manager, vRSLCM, these upgrades are "migration upgrades" to a new appliance and with that logging on the older appliance may not be imported. Here's where logging prior to the upgrade comes in handy:

  • Unexpected behaviours / errors post successful upgrade.
  • Failed upgrade attempt and left in an "unrecoverable" state requiring GSS Support assistance.

Capturing logs from all of the managed components in VMware Cloud Foundation can be done through the SDDC Manager command line "Supportability and Serviceability (SoS)" tool.

 

Connect to the SDDC Manager VM through SSH and run the command below, changing the "Domain Name" and directory as appropriate:

sudo /opt/vmware/sddc-support/sos --domain-name {Domain Name} --log-dir {directory}

Note: Once all the logs have been captured, I suggest making sure that you copy them off the SDDC Manager filesystem as if it's part of the BOM upgrade as a migration upgrade, these logs may be not be imported.

 

Refer to these logs if there is a failure / provide these as part of your Support Request to GSS.

SoS Health Check

 

If the SSH to the SDDC Manager VM is still open, now would be a good time to review the the SoS health check output:

sudo /opt/vmware/sddc-support/sos --health-check

This can be crucial in finding environmental issues with the deployment and assist you in remediating before an upgrade, below I have provided example outputs from my lab.

 

The DNS validation will ensure the component are resolvable by SDDC Manager. One issues to note here is the potential of case-sensitivity of the FQDN being mismatched across the environment. While SDDC manager itself it not case-sensitive, other VMware products deployed can be like vCenter which set all Host entries as lower-case when added to the inventory, therefore best practice is to have all FQDNs in lower-case.

NTP connectivity and status checks against the ESXi hosts and the management VMs. NTP is one of those configurations that gets forgotten about when troubleshooting failures. Time drifts between any of the managed components and led to communication and connectivity issues.

Certificates expiration will be checked for the managed components. Similar to NTP, if any of the certificates are expired, communication will be affected.

Passwords expiry for the PSCs / vCenters will be checked along with services on the and service state for these as well. With SDDC Manager automating so many different types of workflows, password sanity is a must. As SDDC Manager, right now, does not include an automated way to rotate password on expiry, administrators must manage this themselves, in some cases accidentally updating manually through the product itself rather than through SDDC Manager.

Compute resources validation will verify licence expiry, disk health and PSC/ vCenter appliance health. As workloads are added to the environment resource usage issues can occur, ensuring the health of the Compute and management VMs needs to be considered prior to upgrading.

vSAN overall health include disk-group health, Encryption / Deduplication / Compression health (if enabled, in my lab it's not).

General health reports any dumps that have been created on the ESXi hosts in that Workload Domain along with the NSX Controller cluster status. This is really useful to find out if there are intermittent issues in the workload domain that may indicate a reason why an upgrade could go wrong.

All of the above checks can be run individually also, for more information on the SoS tool, see VMware's Official Documentation.

Password Sanity Check

While the SoS health check above does cover some password validity, not all managed components may not be included or report a failure. With the connection to the SDDC Manager VM, use the get credentials API command:

Important: From VCF 3.8.1, you will need the privileged user account credentials to run the command.

curl 'https://sddc-manager.vcf.corp.local/v1/credentials' -i -u 'admin:VMwareInfra@1' -X GET -H 'privileged-username: apiuser@vsphere.local' -H 'privileged-password: VMwareInfra@1' -H 'Accept: application/json'

With the output, connect to each component and validate that the passwords successfully log in and are not expired/ expiring soon.

Workload Domain Pre-Check

 

In the SDDC Manager UI, the LCM service has a built in Pre-check that can provide in-depth information about each workload domain to build on top of the health details found in the SoS health-check. Note: It's worth calling out that this does not do a dry-run pre-check of the patch bundles you may be applying.

 

In the images below, you can see the difference between a Management Domain pre-check and a Workload Domain pre-check (this one is a NSX-T Workload Domain).

 

The management domain will include the SDDC Manager VM and both PSCs as part of the pre-check.

Management Domain precheck image

 

The workload domain pre-check will include either the NSX-V or NSX-T managers and additional vCenter VM as part of the pre-check.

untitled image

untitled image

untitled image

 

If any issues are found during these workflows, the UI will reports something like the image below.

untitled image

Ensure to take proper actions to resolve these issues before proceeding with an upgrade. Even if a pre-check fails in the workflow, it will not stop you from attempting an upgrade though depending on the failure the likelihood of it being successful will be low and fail early.

 

Expanding the failed component will provide more details and potential remediation steps.

untitled image

 

Re-run the pre-check once you have done fixed the reported failure to ensure SDDC Manager validates the fix.

While the above steps and tools will have caught most, if not all, of the environmental issues that SDDC Manager is aware of, the final check is to review the vCenter Web Client for any alerts/ alarms that are unknown to SDDC Manager.

 

untitled image

 

These can arise from some sort of manual intervention or third party linkage that could cause some issues during an upgrade so it worth just calling out as a last place to check. I'm not going to demonstrate where and how to check these alerts, though you can find out more to in the official documentation for vCenter https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.monitoring.doc/GUID-9272E3B2-6A7F-427B-994C-B15FF8CADC25.html.

Release Notes

 

The final and probably the most important of all is to ensure you review the release notes for the BOM you plan to upgrade to as there is valuable information on Known Issues in these releases that you can also utilize in a pre-validation process.

At this point you know that the system is healthy and you are aware of the potential issues that could be faced in the upgrade to the next release. Just download the appropriate bundles from the depot (through the UI or offline) and schedule your upgrade!

About the Author:

Bryan O'Sullivan is a Staff Technical Support Engineer based in Cork, Ireland supporting VMware Cloud Foundatin in Global Support Services.  This was originally posted here, it is being reposted with the authors consent.

 

Filter Tags

  • Lifecycle Management
  • VCF Operational Guidance
  • Cloud Foundation