How to Enable and Configure the new vMotion Application Notification Feature for a VM
In vSphere 8, in continuation of its historical practices of consistently improving Customer's experience and retaining their trust and confidence, VMware added an enhancement to the vMotion feature - arguably one of the most popular features in VMware vSphere. For a refresher on (or an introduction to) vSphere vMotion, please search for "vsphere vMotion site:vmware.com" on your favorite search engine - this post will neither discuss nor describe the standard vSphere vMotion feature in great details.
Before vSphere 8.0, when a vMotion operation is initiated, it automatically kicks into high gears and proceeds to migrate the VM as instructed. No consideration is paid to the state or condition of the Guest Operating System, the applications or the processes running inside the VM at the time of the migration. This lack of attention or consideration often leads to a brief application or service interruption. vMotion is a beautiful thing, no doubt. Among other things, it helps improve efficient resource utilization - and VM uptime and availability. But... the brief interruption, though…
Yes, yes, we hear you... if a vMotion operation sometimes results in (however brief) service interruption, then how can it improve uptime?
Let's digress a little bit (please skip this paragraph if you already know the answer to the question above). First, let's be clear about the nature of the "interruption". At a high level, vMotion migrates a VM by iteratively copying the VM's state from its current Host to its Target Host. When a sufficient portion of the VM's state is transferred and it is determined that the rest can be transferred in one trip, vMotion calls on the Guest Operating System's native quiescing feature to stun the VM for a "last copy" operation before switching the VM over on its new Host. This "stun time" is the point at which the aforementioned service interruption usually happens. To avoid any untoward inferences, let's be clear that the "brief interruption" we mention here is not unique to VMware or vSphere. It's an Operating System response to specific administrative operations - for example, in the Microsoft Windows world, the same interruption happens when you take quiesced (in-Guest) backup/snapshot of even a physical Server. See “Tuning Failover Cluster Network Thresholds” and Section 8.3 (vMotion Considerations) in “Microsoft SQL Server on VMware vSphere® Availability and Recovery Options”
To recap... one of the historical challenges with moving a VM while it's running and providing services is that, at some point during the move, the VM is paused briefly and then resumed at its new home. Some applications are rather hyper-sensitive to this pause and do not respond well or recover gracefully from the brief interruption. While most applications can tolerate the brief interruption and have the ability to dynamically adjust and resume operations, some other applications cannot. Even Customers whose applications are interruption-tolerant prefer to have the option to intervene and prepare their workloads, its processes, applications or services for such interruption. In the example described in the sample PowerShell Script provided in this post, we have a cluster of Microsoft SQL Servers configured in high availability mode, using Windows Server Failover Clustering (WSFC). In this configuration, multiple nodes in the WSFC cluster have the ability to host the primary copy of the clustered SQL Server instance or database. Performing a vMotion operation on any of the nodes in vSphere infrastructure is trivial. As long as the necessary conditions (especially the requirement for available compute resources on the target Host) are met (https://kb.vmware.com/s/article/79616), vMotion will successfully relocate the SQL Server VM while it is running - again, with a brief interruption.
Unfortunately, if the vMotion operation conditions are not satisfied, it is possible for the vMotion operation to take a long time, in which case the VM becomes unavailable for a period long enough for WSFC to trigger a failover of the active clustered resources hosted on the VM. This unintended failover condition, while not destructive on its own, can become undesirable for the Customer, the end users or the trillion-dollar operations supported by the clustered resources. Prior to 8.0, a vMotion operation is an all-or-nothing proposition - if you don't like how it works, you have to disable vMotion for the VM.
NOTE: It is important for the Readers to understand that the vMotion Application Notification discussed in this post addresses only vMotion operations directly or programmatically initiated in vSphere. It does not address other VM operations – for example, suspending or resuming a VM, creating or restoring a snapshot, performing a quiesced backup of a VM will not trigger any response or intervention from the vMotion Application Notification functionality. However, some of these VM operations already have existing solutions like Custom Power Scripts and Custom Quiescing Scripts.
End of digression.
Introducing the vMotion Application Notification feature.
Welcome back. At VMware, we take reliability, uptime, and availability as importantly as we take performance, security and other desirable benefits of virtualization. What if, when a vMotion operation is invoked, the operation waits for an Administrator-configurable "clearance" from the Guest Operating System before it actually begins? With the new vMotion Application Notification feature in vSphere 8, this is now possible.
At a high level, Application Notification enables a vMotion operation to inform the VM that it has been slated for live migration. Administrators can leverage this notification feature to prepare the VM and its application for the operation. Once the preparation is completed, the Guest Operating System can then give the vMotion operation the "all clear" signal (acknowledgement) to proceed. This post is intended to present a detailed description and demonstration of the initial release of the new feature, using one of the most common workloads configurations on vSphere (clustered Microsoft SQL Server VMs) as an example.
vMotion Application Notification is supported only on a VM whose hardware compatibility version is 20 or higher. It uses the VMware Tools in the Guest to perform its operation, so it is very important to ensure that the VM's VMware Tools version is always up to date.
Let's start with a visual representation of how vMotion Application Notifications works in vSphere 8.0
Source: vSphere vMotion Notifications
In this release, vMotion Application Notification is disabled by default. vSphere Administrators are required to work with their application administrators and other stakeholders to jointly enable and configure the feature based on how they expect their application to respond to vMotion operations. In the example described below, we have enabled vMotion Application Notification to wait for a minimum of 60 seconds before performing an actual vMotion operation, unless it receives a direct acknowledgement from the VM to proceed sooner. In our judgement, once Windows (our example) has been informed of a pending vMotion operation, we don't need more than 60 seconds to instruct WSFC to gracefully fail over any active clustered SQL Server resource to another node in the WSFC cluster, making the VM available to (and ready for) vMotion. To be technically accurate and complete, what we're doing in our example is just invoking the "Suspend-ClusterNode" command with the "Drain" switch. This is a native WSFC command instructing WSFC to gracefully offload all active clustered resources to the next most-available peer in the WSFC cluster.
Once we have drained the node, Windows uses the facilities of vMotion Application Notification to inform vMotion that it is ready for the operation. Once vMotion is notified that the VM is ready, it then proceeds accordingly. By leveraging vMotion Application Notification, we are able to take the necessary administrative initiatives and actions to make sure that our clustered resources stay continually available even when we have to perform administrative house cleaning on the VM. Essentially, we're rightly ceding the responsibility of ensuring the availability of SQL Server's clustered resources to the Windows Server Failover Clustering service. As long as there are no active clustered resources on the VM, the possibility of vMotion-induced application non-availability is eliminated.
Because the "Suspend-ClusterNode" command actually pauses the WSFC on the VM after evacuating its active resources, the VM will never become a full-fledged member of the WSFC enclave if left in that state after vMotion completes - it will never be able to host a a copy of the clustered resources in that state. Now, we can't have that, can we? Of course not. When properly configured and registered on the VM, vMotion Application Notification runs constantly and listens for event notifications from vMotion. This was how we received the pending vMotion operation notification in the first place. Once vMotion is completed, the notification state changes. In our example, once Windows receives notification that vMotion is over, we invoke the "Resume-ClusterNode" command, instructing WSFC to bring the node out of its "Pause" mode. After this, the VM is then able to fully participate in the WSFC cluster.
MAKING THE SAUSAGE
So, how do you enable vMotion Application Notification on a Windows VM? Please see https://core.vmware.com/resource/vsphere-vmotion-notifications for detailed answer to this question.
In general. the process to enable vMotion Application Notification for a VM is as follows:
- Enable the feature on the ESXi hosting the VM. Specify a timeout value (in seconds) for how long the vMotion operation should wait (once invoked) for the VM to acknowledge the notification. Because vMotion will automatically start once this timeout period is reached (regardless of whether the VM is ready or not), we strongly recommend that you set this value as high as reasonably tolerable.
- From the same ESXi Host, enable and configure a timeout for application notification on the VM. We recommend that you set this value as high as you consider sufficient for the VM to perform whatever pre-vMotion tasks it needs to make the VM ready for vMotion and send an acknowledgement back to the vMotion Application Notification process. To avoid a situation where vMotion will commence before the VM acknowledges the notification, it is very important that you set this value lower than what is defined for vMotion Application Notification on the VM's Host.
- From within the VM's Guest Operating System, create and register an "Application" to periodically listen and poll for vMotion Application Notification events. These events are communicated to the VM through the in-Guest VMware Tools. Because a vMotion operation can be triggered at any given time, we highly recommend that the polling interval be as low as possible (in our sample script, we poll every second).
- Application registration does NOT persist across VM reboots. This means that a registration must be performed every time the VM is powered off and on, or when rebooted. In Windows, this is easily achieved by adding the registration process to Windows Task Scheduler as a Machine Startup script.
- "Application" (as used in this registration context) does not refer to application(s) installed on the VM. It is called "Application" only within the vMotion Application Notification naming constructs.
- The effective Timeout period is whichever is lower between the Host's and the VM's Timeout values. For example, if you set it to 60 seconds on the Host and 120 seconds on the Guest, vMotion will wait for 60 seconds and begin immediately thereafter.
- Regardless of what it's set to on either side, vMotion will automatically start once it receives acknowledgment from the VM's registered "Application". For example, if you set it to 60 on the VM and 90 on the Host, but the VM sends a notification to the Host after 10 seconds, vMotion will commence immediately. It will not wait for the 60 seconds (the lower value) specified on the VM
- If no "Application" is registered on the VM (for example, if you reboot the VM and don't have a process to automatically re-register with vMotion Application Notification post-reboot), vMotion will NOT wait at all.
Now that the vMotion Application Notification feature is enabled and the VM has subscribed for notification, it is the Administrator's responsibility to:
- Determine what should happen when the "Application" receives notification that a vMotion operation is about to happen.
- Perform whatever preparation tasks need to be performed in the Guest Operating System
- Send notification back to vMotion Application Notification that the requested vMotion can proceed
- Receive notification of completion of the vMotion operation from vMotion Application Notification
- Go back to monitoring/polling for the next notification of planned vMotion event.
SAUSAGE-MAKING DONE. Let's eat.
To help you get started on testing the feature described above, we are have made a zipped package of Sample Scripts available for download here. These Sample Scripts are intended to help you jumpstart your review and evaluation of the vMotion Application Notification feature in your environment. The package contains the following individual files:
- setHostTimeout.py - Run this Script on an ESXi Host (setHostTimeout.py <timeout_value>) to enable and set a timeout value for "vMotion Application Notification" on the Host. The Timeout value is how long you want the Host to wait for an acknowledgement from a VM before performing a vMotion operation when invoked.
- Example: setHostTimeout.py 90 (will cause vMotion operation to wait for 90 seconds before starting)
- Enable-App-Notif-on-VM-from-Host.py - Run this Script on the Host (Enable-App-Notif-on-VM-from-Host.py <vm_name> timeout=<vm_side_timeout_value> to enable and set a timeout value for vMotion Application Notification on a VM. This Script must be run on the Host on which the VM at the time you are enabling this feature on the VM. This Script needs to be run only once. In this initial release, you must inspect a VM's configuration (.vmx) if you need to verify whether or not it has this feature enabled.
- Example: Enable-App-Notif-on-VM-from-Host.pyMySpecialVM timeout=60
- This command will write the following two entries into MySpecialVM's .vmx file:
- vmx.vmOpNotificationToApp.Enabled = "TRUE"
- vmx.vmOpNotificationToApp.timeout = "60"
- Please note that, for vMotion Application Notification to work, it must be enabled on both a VM and the Host on which the VM is running at the time of the vMotion operation. So, if MySpecialVM is migrated from Host-A (which has the feature enabled) to Host-B (which does not), the feature will work. However, when the VM needs to be migrated from Host-B, the feature will fail because Host-B doesn't have the feature enabled. For this reason, it is recommended that the feature be enabled on all Hosts in a given vSphere Cluster.
- WSFC-vMotion-App-Notification-Script.ps1 - This is a Windows Guest-side PowerShell Script which we are using to:
- Register this VM with vMotion Application Notification
- Ensure that Windows is always listening for notifications
- Cause Windows to respond to the notification when received
- Cause Windows to perform the necessary pre-vMotion house-cleaning we desire
- Cause Windows to inform vMotion Notification that it is done with it house cleaning
- Cause Windows to listen for when the vMotion operation completes
- Cause Windows to perform any post-vMotion clean-up we desire
- Goes back to listening for vMotion Application Notification events
- We configured this Script to run every time Windows starts by adding it to Task Scheduler as a Machine Startup script. This way, our VM will always register an application with vMotion Application Notification so it can always listen for any vMotion event notification.
WSFC-vMotion-App-Notification-Script.ps1 is intentionally (over)commented. We have described every important task completed by the script to ensure that you understand what is happening, and why. Our goal is to provide a reference-able template to show a real-world implementation of the new vMotion Application Notification feature for a real application and use case. The verbose comments are intended to fully describe the flow and (thereby) encourage the Reader to build a more comprehensive, elegant and suitable adaptation for their own specific situation and needs. While we have focused solely on clustered Microsoft SQL Server workloads in this demonstration, there is, of course, a vast universe of applicability and usage scenarios for this new iteration of continuous improvements to one of the most prominent and indispensable VMware vSphere features. We earnestly hope that, with this short demonstration, vSphere and Application Administrators will be able to take a closer look at this enhancement and adapt it for their own unique purposes.
LEGALESE: There is no warranty (implied or otherwise) whatsoever that the sample scripts, steps or actions provided in this post is fit for purpose. It is purely for demonstration purposes. Customers are highly encouraged to please refrain from relying on its suitability for their own environments without a thorough review and understanding.
Special thanks to:
- Williams Lam (Senior Staff Solutions Architect, VMware by Broadcom) for the original PowerShell Script
- Oleg Zaydman (Software Engineer, VMware by Broadcom) for the Python Scripts and knowledge transfer
- Mark Xu (Senior Technical Marketing Architect, VMware by Broadcom), Ravindra Kumar (Software Engineer, VMware by Broadcom) and John Savanyo (Engineering Program Manager, VMware by Broadcom) for the testing, sanity checks and quality assurance
- Chen Wei (Product Marketing Manager, VMware by Broadcom) for the expert guidance and facilitation