The vMotion Process Under the Hood
The VMware vSphere vMotion feature is one of the most important capabilities in today’s virtual infrastructures. Since its inception in 2002 and the release in 2003, it allows us to migrate the active state of virtual machines from one physical ESXi host to another. Today, the ability to seamlessly migrate virtual machines is an integral part of nearly every virtualization deployment. The portability of workloads is the foundation for true hybrid cloud experience by being able to move them between on-premises and public clouds using VMware Hybrid Cloud Extension (HCX). vSphere vMotion still is and always will be one of the most momentous game-changers in the IT industry.
A lot has been developed on the vMotion internals over the years to support new technologies.
This post will focus on compute-only migrations, which is the standard vMotion that migrates the active compute state from a source to a destination ESXi host. We also have the possibility to perform a Storage vMotion that, when combined with compute vMotion, is considered to be a XvMotion. Other flavors are Long Distance vMotion and Cross vCenter vMotion which both are primarily vCenter Server operations on top of the ESXi side of the vMotion process.
Reading this article will give you a better understanding of the ‘magic’ that is happening under the hood when you initiate a virtual machine migration.
When a virtual machine migration is started, the vCenter Server instance will execute a so-called long-running migration task to process the migration. The first step is to perform a compatibility check. Is it possible to run the virtual machine on the destination host? Think about possible constraints that could prevent a live-migration. Next is to tell the source and destination ESXi hosts what is happening. A migration specification is created that contains the following information:
- The virtual machine that is being live-migrated
- Configuration of that virtual machine (virtual hardware, VM options, etc.)
- Source ESXi host
- Destination ESXi host
- vMotion network details
The migration specification is shared with the source and destination ESXi hosts by the vCenter Server instance, making sure that all necessary information is exchanged to start the migration process. The vCenter Server communicates with ESXi hosts using the Virtual Provisioning X Daemon (VPXD) that calls out the Virtual Provisioning X Agent (VPXA) that is running on the ESXi hosts. VPXA listens to messages from VPXD, it receives the migration spec and passes that on to the VMX process via hostd. The Host Daemon (hostd) maintains host-specific information and access for management including virtual machine telemetry like the VMstate. When a migration is started, hostd puts the virtual machine in an intermediate state so the virtual machine its configuration cannot be changed during the migration.
The Virtual Machine Monitor (VMM) process is in charge of managing the virtual machine memory and transfers virtual machine storage and network I/O requests to the VMkernel. All other, non-critical to performance, I/O requests are forwarded by VMM to VMX. The Virtual Machine Extension (VMX) process runs in the VMkernel and is responsible for handling I/O to devices that are not critical to performance. Note that VMM is only used at the source ESXi hosts during the migration, because that is where the active memory of the virtual machine resides.
After this is done, the VMkernel migrate module on the source ESXi will open sockets on the vMotion enabled network to set up communication with the destination ESXi host.
Prepare Phase to Pre-Copy Phase
By now, all processes and communication paths are ready for the live migration to start. The prepare phase is all about making sure that the destination ESXi host pre-allocates the compute resources for the to-be migrated virtual machine. Also, the virtual machine is created on the destination ESXi host, but it is masked. All the information about the virtual machine configuration is already know as that is included in the migration spec.
With the prepare phase done, the process moves to the pre-copy phase where the memory is transferred from the source to the destination ESXi host. There is a need to trace all the virtual machine memory pages on the source ESXi host. By doing that, the vMotion process knows what memory pages are overwritten during migration, referred to as dirty pages, as it needs to re-send these memory pages to the destination host.
During the pre-copy phase, the vCPU’s, in use by the virtual machine, are briefly stunned to install the page tracers. The VMkernel migration module now asks VMM to start the page tracing as VMM owns the memory page table state of the virtual machine. The following diagram shows what happens when the guest OS is writing data to memory during a vMotion:
Iterative Memory Pre-Copy
Page tracing is a continuous cycle. It will work towards memory pre-copy convergence by using multiple iterations. The first iteration (precopy phase-1) copies all of the virtual machine memory pages. The following iterations (precopy phase 0 to n) work on copying the dirty memory pages. To give you an example, this is what the iterations could look like as we live-migrate a virtual machine with 24GB of memory:
Phase -1: Copy the 24GB of virtual machine memory and trace pages. As we send the memory, it dirties 8GB.
Phase 0: Re-transmit the dirtied 8GB. In the process, the memory dirties another 3GB.
Phase 1: Send the 3GB. While that transfer is happening, the virtual machine dirties 1GB.
Phase 2: Send the remaining 1GB.
As the memory pages are copied from the source to the destination ESXi host, we need to determine when all memory is copied to its destination. VMM asks the VMkernel if the pre-copy process can be terminated after each iteration. This is only possible when all memory changes (dirty pages) are copied to the destination host. Part of the iterative memory pre-copy algorithm is to match all destination memory pages to its source. Starting at page zero all the way to the maximum or last memory page number, all memory pages are sequentially checked to see if the destination pages are in sync with the source pages.
To determine if we can terminate pre-copy, we need to verify whether we can complete the last memory page copy in a window of < 500ms. We can calculate this using information in the migration tax:
- The migration transmit rate; at what speed (GbE) are we copying memory data betweens the hosts?
- The dirty page rate (GB/s); how many memory pages are being over-written by the guest OS?
- How many pages do we have left to transmit to the destination host?
If no, the next iteration happens. If the outcome is yes, the VMkernel migrate module will terminate the pre-copy process.
Now what will happen if the dirty page rate is higher than the migration transmit rate? If that is the case, there is no point in doing another iteration because we can never reach memory pre-copy convergence and the migration would come to a halt. This is why we introduced Stun During Page Send (SDPS) with vSphere 5.0. Basically, SDPS is a way for the VMkernel to tell VMM to not run the scheduled instructions but to introduce a really short ‘sleep’. This may sound like an impact on workload performance, but this happens at a fine-grained level. We are talking microseconds here and it is because of these very small timeframes we can converge and the vMotion process will succeed.
SDPS is executed with each iteration if the dirty page rate > transmit rate. Subsequent iterations only copy the dirty memory pages that were modified during the previous iteration. A shorter duration in iteration gives the guest OS less opportunity to modify or dirty its memory pages, thereby shortening the next pre-copy iteration. Although there is a form of performance cost involved, typically SDPS is not noticeable to the workload. The goal is always to leave the guest OS un-aware of the migration happening.
With the memory pre-copy migration terminated by the VMM process, all memory pages reside on the destination ESXi host. VMM now sends a remote procedure call (RPC) to VMX that it can suspend the source virtual machine. VMX will enter the checkpoint phase where it suspends the virtual machine and sends the checkpoint data to the destination ESXi host.
In the process, the virtual machine on the destination ESXi host will be de-masked, and the state is restored using the checkpoint data. What basically happens is that the virtual machine on the destination is powered on, but the boot process is interrupted and pointed to the memory pages that are migrated from the source ESXi host. All this typically happens in 100-200ms. That is the stuntime in which the virtual machine is not in running state. The duration of the stun time depends on a variety of factors like host hardware and dynamic guest workloads.
The virtual machine is now live-migrated! Although I’ve tried to explain the vMotion process in-depth, there are far more details to share on what happens in the background, but this article gives you an idea what is happening on the background when a vMotion is initiated.