This article gives a technical preview of new features in vSphere DRS and vMotion that are being considered for vSphere 8 Update 2.
When a vMotion happens on a VM, the VM’s dirtied memory (RAM) pages are copied over to the destination ESXi machine in phases. One of the final phases in that process causes the VM to go through a period called the "stun time", which is ideally not noticed by the guest operating system or the application running on it. At that point, the guest operating system is stopped for a short period while the final memory page copying takes place. vSphere always works with the VM to minimize that stun time. We describe some further enhancements to that process here in the context of VMs that have vGPUs assigned to them.
When a virtualized GPU (vGPU) is assigned to a VM, that VM can also be subject to vMotion – for a variety of reasons. Remember that the GPU is operating asynchronously from the CPU. The GPU is offloading many math calculations and is managing data of its own that is separate from the CPU’s memory. In the vSphere Client for vSphere 8, you can assign multiple vGPU profiles to one VM - representing a full physical GPU each - or representing part of a GPU, too. It is not uncommon to see VMs with four or more full-memory profile vGPUs assigned to them today (i.e. occupying four physical GPUs fully).
A common reason for a vMotion of any VM is the need to place the host server into maintenance mode, for system upgrade or for new VIB/driver installation, for example. This evacuation of VMs from a host, using vMotion, happens automatically when you place the host into maintenance mode – including now those VMs that have vGPU profiles attached to them. You can find more information on automated migration when entering maintenance mode in this Knowledge Base article There can be other reasons for a vMotion to occur too – for load balancing a cluster, for example. More details on that are here
The physical GPU device that is backing the vGPU profile associated with the VM has separate memory hardware of its own, called “framebuffer” memory. Some GPU models have 40GB, 80GB or higher framebuffer memory today. Here is a view of a VM that has two full A100 40GB devices associated with it, through the appropriate vGPU profiles. These are time-sliced profiles, that allow multiple full GPUs to be assigned. A VM on vSphere 8 Update 2 can have up to 16 vGPU profiles (representing full physical GPUs) assigned to it. This is to accommodate very large training jobs, such as those done with large language models.
Figure 1: A VM with multiple full vGPUs - mapping to full physical GPUs assigned to it in the PCI Devices area
When a vMotion happens, the contents of the GPU’s framebuffer memory must be copied from the source GPU to the GPU on the destination host machine. We can therefore have a lot more active memory contents to deal with than just the VM’s own RAM. With this large amount of GPU framebuffer memory data to be copied over to a new host, the bandwidth of our network connection between the two hosts becomes critical.
As a first enhancement in vSphere 8 Update 2 to help with this data transfer, there is now an estimate that is made by DRS, when it first places the VM on a host, of the amount of stun time this VM would need, based on the user-supplied network bandwidth (10 GigE or 100 GigE in the examples here). This stun time estimate is dependent on the available network bandwidth and the size of framebuffer memory that is allocated to the VM. The higher vMotion network bandwidth used below in the right-hand example can dramatically affect the vMotion stun time for your vGPU-aware VM.
Figure 2: Expected vMotion Stun Times in the vSphere Client
Adjusting the Stun Time for a VM
You can now also manually set the vMotion Stun Time Limit that your workload can tolerate on a VM of this type, i.e., a VM that has one or more vGPU profiles assigned to it. That is seen in the lower left side of the “Edit Settings” on a VM in the vSphere Client below. When the vSphere administrator adjusts this setting, the vSphere vMotion algorithms now will allow longer than the default of 100 seconds for vMotion stun time and give the vMotion a better chance of copying everything over to the destination machine. This may take some experimentation in testing to get this value right.
vSphere will also calculate its own estimate of the amount of vMotion stun time needed, based on the memory sizes of your vGPU profiles, and let you know if you have allocated too little time. This is seen in the yellow colored area on the right side below:
Figure 3: Adjusting the vMotion Stun Time Limit in the vSphere Client
Tuning Options for DRS Activity
We have also added a set of new, advanced DRS settings to give you much more control over the DRS behavior with your vGPU-aware VMs and clusters.
Due to the extended vMotion downtimes for vGPU VMs, vCenter has set Load Balancer recommendations to Manual Remediation till now. However, when a Cluster is properly configured and has a sufficient vMotion network, VMs with smaller vGPU profiles may freely migrate under the default 100 second timeout. If the Stun Time introduced by the vGPUs in your VM is acceptable, one can enable automatic DRS Load Balancing and Maintenance Mode evacuations with vSphere 8.0 Update 2. Note that DRS currently does not take the loading of the GPU itself into account in its decisions to move a VM from one host to another for load balancing, but rather considers the CPU and memory consumption factors.
1. Consolidating "Smaller" VMs to Hosts
VgpuVmConsolidation = 1
When set to 1, this advanced setting causes DRS to tightly pack VMs with fewer full-memory vGPU profiles (e.g. 1 vGPU) onto hosts that have extra compatible GPU capacity. This allows for larger remaining GPU capacity on other hosts and thus allows VMs with multiple full-memory vGPU devices to power on. This scenario is described in this blog article
2. DRS Load Balancer and vMotion
LBMaxVmotionPerHost = 1
This advanced setting, when set to 1 on a DRS cluster, will cause DRS to allow one vMotion for consolidation of VMs per host on any one scan that DRS does. The functionality is to reduce the number of vMotions the Load Balancer will conduct to achieve consolidation goals.
3. Checks on vMotion Stun Time
PassthroughRequireDrs = 1
This advanced setting, when set to 1, performs a series of checks to ensure that the VM can be moved using a vMotion, and that its stun time during vMotion fits within the estimate calculated by DRS, at VM power-on time.
4. DRS Load Balancing and Maintenance Mode VM Evacuation
PassthroughDrsAutomation = 1
This option, when set to 1, enables automatic DRS Load Balancing and Maintenance Mode evacuations of VMs with vSphere 8.0 U2, *provided* the VM stun time introduced by the vGPUs is acceptable, according to DRS’ calculations of the required stun time.
5. Progressive De-fragmentation of GPU Resources
PassthroughForceDrsAutomation = 1
When this option is set to 1 on a DRS cluster, it will allow for Maintenance Mode Evacuation without consideration of the vMotion Stun Time Limit (100 seconds by default).
PassthroughDrsAutomation and PassthroughForceDrsAutomation can be used in conjunction with Host Maintenance Mode to progressively de-fragment GPU resources in a Cluster.
These new advanced options may be used together to give full DRS automation in vSphere 8 Update 2.
To show you our progress in handling vGPU-aware VMs with DRS over time, here is an overall view in tabular form of the enhancements being made on this area in successive versions of vSphere in the areas of maintenance mode evacuation, load balancing and stun time calculations. As you can see from here, our GPU-focused R&D engineers are continually improving the handling of the data-intensive and network-intensive aspects of a VM with one or more vGPU profiles attached to it. This works continues to enable larger machine learning and high performance computing workloads to fit well into the vSphere and VMware Cloud Foundation architecture.
Figure 4: Enhancements to vMotion and VM Placement Over Successive Versions of vSphere
With vSphere 8.0 Update 2, the system administrator can now see advice from DRS that longer vMotion stun times are needed for VMs that have vGPU profiles assigned to them – especially those profiles that capture larger GPU memory allocations.
With that calculated guidance from vSphere, the administrator can make an educated decision on how long they wish to tolerate for vMotion stun time for any particular vGPU-aware VM. This allows larger GPU-consuming jobs to continue working, such as a machine learning training job, even when a vMotion is needed, without interruption. There is also an added set of advanced options to make DRS more accessible to you in its management of vGPU-aware virtual machines, helping to automate more processes around these GPU workloads on vSphere 8 Update 2.