vSphere 7 Update 3 - What's New
vSphere 7 Update 3 is the ultimate update release to vSphere 7, making it the best vSphere ever. With every update there are hundreds of changes and improvements to add features, fix issues, improve user experience, and increase compatibility. We are announcing it today, and the software itself will be available soon.
Let’s go over the highlights!
(We have a video version of this, too, over on our YouTube channel. Check it out and subscribe: https://www.youtube.com/channel/UCN8FHFshMw-15AtFKWSLczA)
vSphere with Tanzu
One of the biggest features of vSphere 7 is the integration of Kubernetes, what we call vSphere with Tanzu. It allows organizations to easily run and support container-based modern applications on top of the infrastructure they have already.
Each update to vSphere improves performance and efficiency, and there’s a new study on the efficiency of vSphere and vSphere with Tanzu, showing that it can run 6.3x more workloads than other solutions: https://blogs.vmware.com/vsphere/2021/08/vsphere-with-tanzu-supports-more-container-pods-bare-metal.html
One of the best things about vSphere with Tanzu is that it fits into existing environments. To make that true for more networks we have added flexible DHCP support. If you choose DHCP you can automatically populate IP addresses, DNS, NTP, and other values, and override them if you want. It makes it a lot easier to configure and deploy.
This works for both the Management Network and the Workload Network. It’s not all or nothing, either -- you could set your Management Network to use Static values and your Workload to use DHCP values if you so choose.
If you do this, we recommend using DHCP client identifiers with DHCP reservations. You configure the client identifiers in Tanzu and use those on the DHCP server so that addresses don’t change even if the MAC addresses of the cluster VMs change. That might happen during an upgrade when new cluster VMs are deployed to replace downlevel ones.
There are many moving parts in a Kubernetes environment, and while we take care of a lot of that complexity automatically, sometimes we still need people to enter credentials or addresses. As such, having descriptive error messages is even more important. With Update 3 we have added better error messaging so you can more quickly address what may be misconfigured rather than hunting through logs to search for what’s broken. For example, in the slide image, you can see that the load balancer was configured to use an incorrect username. Fix that username and the Supervisor Cluster will retry (because K8s is declarative, e.g. I want 10 VMs, it will retry until the desired state is reached or times out).
Lifecycle, Upgrade, and Patching
Historically, SD cards or USB devices have been chosen to free up device bays and to lower the cost of installing ESXi hosts. Such devices, however, have lower endurance and exhibit reliability and issues over time. SD cards and USB drives also experience performance issues and may not tolerate high-frequency read-write operations.
This is not a problem with flash, but it’s a design point because flash is different than other storage media. People don’t realize it, but NAND flash memory is a consumable part. The voltage needed to write the bits permanently to the memory wears it out over time, so every time you write to a flash storage device you’re wearing it out a little bit. Over time, this adds up, and if you keep writing in the same spot you wear it out completely. Like wearing a hole in your shoe by walking a lot.
A flash device that is designed for higher I/O has methods for dealing with this type of wear so that the device remains reliable. The SSD or NVMe drive in your server uses "wear leveling" methods to spread the writes out across the whole drive. These types of drives also have spare capacity, so if a memory cell wears out it can replace it seamlessly, without you ever knowing. In fact, most flash devices have considerable extra capacity for this. For example, a 480 GB Intel S3500 SSD drive actually has 528 GB of memory built in. This is also why flash storage is rated in “Drive Writes Per Day” or DWPD, to help match the workload to the flash storage. Different types of flash storage has different tolerances for writing, and you'll see descriptors like "read-intensive" or "write-intensive" for certain types of drives based on their endurance. Single-Level Cell (SLC) flash is most tolerant of writes, then Multi-Level Cell (MLC), then Triple-Level Cell (TLC), and when we size the storage devices correctly for the amount of writing we will do we get years of service from those flash devices.
SD cards and USB sticks don’t have all that fancy logic, though. They’re simple devices, and not designed for modern operating systems. Heck, even cameras that use SD cards mirror them, which tells you something. Our hardware partners have had mirroring devices for SD cards for a long time, and customers have had to replace cards over the lifespan of their servers, so we know there's a tradeoff there.
So what are we doing about it? We are deprecating use of SD and USB drives as boot media. It’ll give you a warning that the boot volume is in a “degraded” mode because we’re doing things on the back end to help limit the writes to the device.
You can find more information about all of this at: https://core.vmware.com/resource/esxi-system-storage-faq
vSphere Lifecycle Manager continues taking over all aspects of patching and deployment. First, we added depot editing, because once in a while you need to remove something, too. This can happen if a driver or other component needs to be recalled. It’ll send a notification that it happened and then allow you to take action.
Second, hardware compatibility has been extended from I/O controllers to also include drive firmware. This is really important for vSAN, as drive firmware can make a big difference. Remember, even hardware is really software and needs to be patched, too. We also continue adding partners to the list of vendors whose hardware support managers can work with Lifecycle Manager, so that customers can patch hardware and software together, as well as create a declarative image that specified things like BIOS versions.
Last on the list is vSAN witness management. Lifecycle Manager already handles other appliances for NSX and Tanzu, and now it can manage the vSAN witness if it’s used in a standalone fashion (which is the most common use case). This move helps a large portion of our customers have one less thing to worry about.
At VMware we’re always looking forward to what positive impact we can have on operations. We’re thinking a lot about the edge lately, and what it’ll take to serve the unique needs of that space, as well as what we can do to help customers reduce risk around upgrades and patches. To that end we’re starting to move some technology from VMware Cloud into on-premises deployments, like what we call “Reduced Downtime Upgrades.”
If you’ve ever upgraded vCenter Server you will think this seems familiar, except it isn’t just for major upgrades anymore, it’ll also be for regular patching.
This is API-driven for right now, and for right now it doesn’t have an official use (everything works “normally” still in 7U3), though if you keep your eyes open at VMworld you might see some ways we’re considering using it. It is worth thinking about how this might affect your environment in the future, though. There’ll be an appliance download, and there will need to be enough resources to deploy a second vCenter Server Appliance (VCSA). A fresh VCSA will also mean that anything bad that’s happened to the VCSA gets undone, too, but does underscore our policy that the insides of the appliances are part of the product and shouldn’t be messed with except for troubleshooting. It’s also not a replacement for configuring vCenter Server backups, though – those are still very important.
Artificial Intelligence & Machine Learning
VMware and NVIDIA worked together to create the NVIDIA AI Enterprise Suite. There are a lot of companies out there working on AI/ML projects to help increase the business value of their applications and data, and this helps them get to where they’re going much faster and more reliably, with less friction between traditional IT operations and all these new technologies.
VMware brings our deep knowledge of infrastructure, operations, and management, and NVIDIA supplies GPU expertise, GPU virtualization technology, and prebuilt sets of tools that data scientists can use to create applications and deploy them quickly. This includes frameworks and tools like PyTorch, TensorFlow, TensorRT, RAPIDS, Triton Inference Server, and more, all packaged up as containers that can be easily deployed and managed.
The other big way VMware works with GPUs is through Bitfusion. Bitfusion helps to virtualize the GPUs, allows for pooling and resource management, and decouples the GPUs from the workloads for management flexibility.
Bitfusion 4.0 is out and adds support for managing the Bitfusion auth tokens through the Kubernetes secrets mechanisms, so you don’t have to copy files around or hardcode the token in places. There are filters to narrow the GPU and server pool before allocation of resources, so if you only want T4 type GPUs on servers with RDMA you can get that. There’s a data retention policy for Bitfusion logs, and new monitoring support so that external monitoring tools can collect server statistics. There’s improved API support, keeping up with the popular protocols for interfacing with GPUs.
And, not listed here but probably the most visible improvement, the Bitfusion plugin supports dark mode!
Resource Management
Prior to vSphere 7 Update 3 memory statistics were not easily accessible without installing additional third-party command line tools. They’re now built in directly to vSphere and visible in the UI as well. This enables troubleshooting and monitoring of memory bottlenecks between normal system DRAM and Persistent Memory, Optane, and NVDIMMs, at the host and VM levels. This is supported on Cascade Lake and Ice Lake families of Intel CPUs, and there is a compatibility matrix for this feature.
Update 3 brings better logic for retrying maintenance mode if it didn’t succeed, as well as how DRS moves workloads during maintenance mode. DRS keeps track of workloads that are harder to move, like big VMs or VMs with high I/O, and then tries to move them as few times as possible. In most cases it can do it with only one move now, which makes updates and upgrades easier.
vSphere Cluster Services was introduced in vSphere 7 Update 1 as the new home for DRS inside a vSphere cluster, moving services off vCenter Server to reduce dependencies there. We’ve received a fair amount of feedback about it and the team working on it has done a great job incorporating that into this new release. It’s worth noting that if YOU want to submit feedback, we’d love it. Use the “smiley face” button in the upper-right of the vSphere Client!
You can now choose the datastore you want them on, and control some of the affinity settings if you do or don’t want them on a particular host or near a particular workload. vSphere VM hardening settings have been pre-applied to the agent VMs, in conjunction with the Security Configuration Guide. vSphere has very secure defaults but there were some areas that we could do things more explicitly to help customers who are doing compliance audits. So we did!
The most noticeable change with Update 3 is that the agent VMs are no longer named with parentheses, or spaces for that matter. They’re named with a UUID instead.
Security & Compliance
Ransomware continues to be devastating to many organizations, and we’ve started creating and collecting information on tactics for resiliency with vSphere deployments. You can find this at our Ransomware Resource Center: https://core.vmware.com/ransomware
The vSphere Security Configuration Guide continues to be the best baseline guidance for securing vSphere. You can find it at: https://via.vmw.com/scg
Last, many of our customers are interested in regulatory compliance, things like NIST 800-53, PCI DSS, CMMC, GDPR, HIPAA, NIST 800-171, and so on. Compliance is a business requirement, and different than security, even though they often check the same things. One of the hardest parts about compliance is that it’s something we can’t help with directly. All compliance is assessed based on an implementation of a product, because the auditor wants to see if good decisions were made for the implementation.
It’s not unlike a building inspector inspecting a new house. They don’t want to see the pile of wires on the garage floor, they want to make sure you didn’t wire a short circuit. Even safe and secure products like vSphere can be implemented insecurely. Flexibility like that of vSphere is a strength, but can enable some insecure choices, too.
What we CAN help with, though, is helping to explain the security controls to the auditors, and if you go to https://core.vmware.com/compliance you’ll find an ever-growing collection of guides that do that. They’re designed to help auditors understand what a vSphere Admin might have done to secure the environment, and how that maps into the compliance frameworks.
Guest OS & Workloads
If you aren’t familiar with cloud-init, it’s a vendor-agnostic tool for Linux systems to be customized at deployment time. VMware’s efforts with cloud-init have now been merged into the main code base for the cloud-init project, meaning that as Linux distributions get updated they will have what they need to work seamlessly with VMware, by default, and won’t require additional tools to be installed.
The guest data publisher is a feature of VMware Tools, and an enhancement to the mechanism that guests can send data to ESXi to be able to send more data and different data. This is controllable through the same GuestInfo mechanisms that exist so if a guest or a vSphere Admin wants to limit this they can.
UEFI 2.4 is an update to the VM EFI boot firmware to support compatibility for upcoming guest OS releases.
Last, full AMD support for VBS. VBS, or virtualization-based security, is also known as Microsoft Device Guard and Credential Guard, which helps protect secrets inside Microsoft Windows guest OSes. The way it works is that it’s a little bit of Hyper-V running inside the guest, to create a secure enclave that Windows can use for its secrets. This is effectively nested virtualization, Hyper-V on ESXi, and VMware has the absolute best implementation of this anywhere in the industry. We’ve worked hard to make it perform very well and be identical to the experience a customer would have on physical hardware, but that means a lot of very low-level development work, and a lot of work with our partners to make changes to guest OSes and even hardware firmware if needed. Everything is aligned now, and while we’ve had customers running VBS on AMD for some time, all the pieces are available, from the hardware all the way through to the guest OS, to make it seamless for everybody.
The quote in the title is from the American jazz trumpeter Miles Davis. He was most likely talking about how life is short, which is exactly why we want to make configuring and maintaining time providers in vSphere easy and reliable. Accurate and consistent time is really important to computer systems, and lots of little problems and delays happen when it isn’t set up right.
You might remember that we added Precision Time Protocol support in vSphere 7, and over the course of vSphere 7 we’ve made it easy to add to guest OSes through the VMware tools. Precision Time Protocol provides high-accuracy time to systems, from a PTP server on the network. Virtualization systems have different layers to them, of course, and those layers can make a difference to latency, which in turn impacts accuracy. With vSphere 7 Update 3 there are now a few different ways that PTP can be brought into an environment, through a VMkernel adapter or through a dedicated passthrough interface, too, so that you can choose which method is right for your workloads.
Because of the way the PTP standards are designed there isn’t the same level of resilience that NTP has. As such, it’s possible that PTP sources might not be available. To help combat that we have added the option to have NTP as a fallback mechanism. This way you can continue having time sources, and when the high-resolution source comes back online it’ll be preferred.
You might also notice that there are new vmware.pool.ntp.org sources there, too. It’s important to use as many NTP sources as you can. The NTP project maintainers themselves have advised us directly that they’d like everyone to use at least four sources. Who are we to argue with the experts?
Storage
NVMe over Fabrics is hot right now and adding TCP/IP support means that access has opened up to commodity NICs. You certainly can still use fibre channel HBAs and RDMA-capable adapters, but you can also use standard Ethernet hardware, too.
As environments grow there is interest in having more ESXi hosts attach to a single datastore. With Update 3 you don’t need special approval to have up to 128 hosts connect to a single VMFS6 or NFS datastore. This also should help avoid the need for storage vMotion, too, and make upgrades easier. It’s worth noting that this isn’t a cluster size increase, it’s referring specifically to the hosts attaching to the datastore.
VMware updated the VMFS-6 Affinity Manager which keeps track of available resources on disk, so that when a VM wants to write to disk it can do so more quickly. Update 3 now supports handling first-class disks and container-native storage in these same ways, making modern workloads faster.
Last, vVols continue getting better. In Update 3 there’s a better procedure for taking large volumes of snapshots. It batches up the work, which ultimately translates into less effect on the VMs and environment while the snapshot is happening.
These updates are courtesy of our Storage and Availability Technical Marketing Group, and you should check out their information on vSAN 7 Update 3 as well.
vSphere Management, APIs, and Developer
The vSphere Client is now fully compliant with the Web Content Accessibility Guidelines. This is a fulfillment of the commitment we’ve made as a company to inclusiveness of all types. It’s also led to a lot of improvements in the interfaces themselves for everyone. VMware’s dedicated team of User Experience designers continues to think about how humans use VMware products, and their work to remove friction from the lives of vSphere Admins is much appreciated.
PowerCLI 12.4 is out, and brings with it a lot of BIG improvements: https://blogs.vmware.com/PowerCLI/2021/09/powercli-12-4-whats-new.html
Wrap-Up
vSphere has a YouTube channel! Come see the stuff we put out. If you subscribe you’ll get notified when our live streams start and new videos are posted. We do a monthly stream called “vSphere LIVE” where we talk about different things, and have experts answer questions right then. It’s fun and informative!
https://www.youtube.com/channel/UCN8FHFshMw-15AtFKWSLczA
VMware runs a security advisory mailing list that we urge everyone to sign up for. That list is ONLY for security advisories, so that organizations can get a head start on protecting themselves if something comes up.
https://www.vmware.com/security/advisories.html
As always, we appreciate you as a customer. Thank you for the feedback and engagement. Please keep letting us know how we can improve vSphere for you. We hope you’re safe and healthy. Take care!