Site Recovery Manager Best Practices
Introduction
VMware Site Recovery Manager (SRM) provides business continuity and disaster recovery protection for VMware virtual environments. Protection can range from virtual machines (VMs) residing on a single, replicated datastore to all the VMs in a datacenter and includes protection for the operating systems and applications running in the VM.
The goal of this white paper is to provide you with SRM performance data and recommendations so that you can architect an efficient recovery plan that minimizes the recovery time for your environment.
This white paper addresses various dimensions on which the recovery time depends:
- Number of virtual machines and protection groups associated with a recovery plan
- Improvements to the performance of recovering multiple protection groups within a single recovery plan
- Leveraging DPM and DRS for a better recovery
- Configuration of various recovery plan parameters
- Priority assignment of virtual machines in the recovery plan
Furthermore, we suggest best practices in applicable areas so that you can optimize the recovery time using SRM.
About Site Recovery Manager
SRM requires a protected site and a recovery site and an SRM server must be installed at each site. Additionally, each site must be managed by its own vCenter Server.
On the protected site, you configure a protection group, which is a group of virtual machines that failover together. On the recovery site, you add the protection group, which is a collection of VMs that can be recovered simultaneously.
SRM supports vVol, vSAN, local, NFS, iSCSI, and FC storage and supports three forms of replication: array-based replication (ABR) in which the storage subsystem manages VM replication, vVol replication (utilizing array-based replication) and host-based replication (vSphere replication) in which ESXi manages VM replication.
SRM automatically discovers datastores set up for array-based replication between the protected and recovery sites.
vSphere Replication (VR) replicates only the most recent data in changed disk areas to increase network efficiency and eliminates the ABR requirement for having identical storage arrays across sites. SRM supports 2000 VMs for VR.
Performance Considerations
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the two most important performance metrics you need to keep in mind while designing and executing a disaster recovery plan.
- RPO defines the point in time at which data must be restored to meet service level agreements.
- RTO is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
For array-based replication, RPO is fulfilled by the replication schedules configured on the storage array. For vSphere replication, you set the RPO using the SRM plugin in the vSphere Client. The minimum RPO you can set with VR is 15 minutes. The VR algorithm adjusts the replication schedule dynamically in order to fulfill the RPO. Site Recovery Manager helps you meet RTO by minimizing datacenter recovery time, which is crucial for any business continuity or disaster recovery solution.
Architecting Recovery Plans
This section presents Site Recovery Manager best practices in certain areas. These best practices will help you architect efficient recovery plans that minimize recovery time.
Virtual Machine to Protection Group Relation
With ABR, for each protection group included in a recovery plan, Site Recovery Manager needs to communicate with the underlying storage to create snapshots of replicated LUNs (or promote replicated LUNs in case of real recovery) in that protection group and present them to recovery site hosts.
We compared recovery time measurements with ABR for a test recovery for 300 virtual machines in a single protection group vs. 300 virtual machines in 30 protection groups (10 virtual machines per protection group).
The test recovery performance was almost similar despite the fact that one recovery plan had more protection groups when compared to the other.
- Key takeaway:
Adding protection groups to a recovery plan does not increase the recovery time by a large factor.
Recovery Time – iSCSI/FC vs. NFS
When working with NFS storage, SRM mounts the replicated NFS volumes/snapshots on the ESXi hosts during the recovery.
This means that if it takes X seconds to mount a single replicated NFS volume on a single recovery site host, and you have multiple volumes to mount across multiple hosts in the recovery site cluster, then it will roughly take (((unique_volumes_to_be_mounted_across_all_hosts)/(2)) * X seconds) to mount all the volumes across all the hosts during a recovery (both for test and real recoveries). Similar behavior is expected for unmounting volumes.
For large scale recoveries with a large number of hosts and NFS volumes hosting the protected VMs, it might take more time to mount/unmount NFS volumes across all the recovery site hosts as compared to rescanning all host HBAs for iSCSI/FC.
When working with iSCSI/FC storage, SRM initiates a rescan on the HBAs on the hosts used for the recovery to make the storage available to all the hosts during a recovery (test and real). These rescan calls are issued in parallel across all recovery site hosts.
- Key takeaway:
It is a good practice to have fewer but larger NFS volumes so that the time taken to mount a large number of such volumes decreases during the recovery. This might also translate to fewer protection groups on your setup.
Placeholder VM Placement
When you create a protection group for a set of virtual machines on the protected site, SRM creates placeholder virtual machines at the recovery site for each protected virtual machine. During a recovery, SRM replaces each of these placeholder virtual machines with full versions of the virtual machines, which are recovered from the datastore. SRM automatically leverages DRS (independent of any cluster/DRS settings) to optimally place the virtual machines across various ESXi hosts in the cluster, even taking advantage of newly added hosts. This ensures that your cluster remains balanced after the recovery completes. This also helps cluster and VM performance during a boot storm—when many VMs are powered on at the same time after they are recovered.
SRM used to initiate a default of two “Recovery VM” operations per host. It could concurrently start-up to a maximum of 18 such operations during recovery. SRM now initiates a default of unlimited “Recovery VM” operations across all hosts. This helps tremendously in decreasing the recovery time.
For more details on configuring this "unlimited" amount of Recovery VM operations during recovery to a specific value which limits these throttling jobs, see the next section, Configuring Job Throttling Parameters for VM Power Operations.
SRM also offers host failure resiliency—if an ESXi server hosting a placeholder VM does not respond during a recovery (for example, if the ESXi host does not have access to the recovered datastore), then SRM selects other available hosts.
- Key takeaway:
It is a good practice to have vSphere DRS enabled on the recovery site. SRM 5.0 leverages DRS to reserve sufficient resources during the recovery in order to successfully power on all VMs.
Configuring Job Throttling Parameters for VM Power Operations
SRM offers certain parameters to control the boot and shutdown operations per cluster and per host.
You might want to set these parameters to an appropriate value that would be in line with the capabilities of the underlying systems being used for the disaster recovery. This gives you some control over the boot storm executed by an SRM-initiated recovery.
Here is some data which throws light on how much guest boot up latencies add up to the overall recovery time:
As seen in Figure 7, Steps 4 and 8—both of which involve starting up the guest and waiting for VMware Tools—take up a considerable chunk of time from the overall recovery time. With a boot storm, these latencies could increase as a result of I/O bottlenecks or any other resource bottlenecks your platform might encounter.
To configure parameters that will help control the boot storm effectively:
- Locate the SRM folder, and in it find the config folder.
- You should now be able to find the vmware-dr.xml file.
- Use a text editor like Notepad to edit this file.
- Look for the section in this file that is denoted by <Config>.
- Add the following lines between the <Config> and </Config> tags.
<defaultMaxBootAndShutdownOpsPerCluster>20</defaultMaxBootAndShutdownOpsPerCluster> <defaultMaxBootAndShutdownOpsPerHost>20</defaultMaxBootAndShutdownOpsPerHost>
Note that 20 is just a sample value that we've used here. Consider the performance of your underlying platform when choosing an appropriate value. The following steps describe one way to do this:
- Set these values to a specific number.
- Run a test recovery.
- If you notice the following error messages:
Error - Cannot complete customization, possibly due to a scripting runtime error or invalid script parameters.
Error - An error occurred when uploading files to the guest VM.
Error - Timed out waiting for VMware Tools after 600 seconds.
Then you can:
- Run Cleanup, decrease the config values, and go to Step 2.
If you do not notice any errors and feel that your platform is under committed, then:
- Run Cleanup, increase the config values, and go to Step 2.
Based on the above steps, try to find a sweet spot for these config values such that you can gain optimum RTO while not completely overcommitting the platform.
If you want to increase SRM log retention, refer to:
http://blogs.vmware.com/uptime/2011/04/increasing-srm-log-retention.html
Standby Hosts on the Recovery Site and Enabling DPM
Site Recovery Manager works with vSphere to recover VMs even on standby hosts.
If the recovery site cluster has a couple of standby hosts, SRM works with DPM to power on any standby hosts in the cluster. SRM then works with DRS to use these hosts for recovering VMs during a test and real recovery. SRM brings DPM out of automatic mode while VMs are being powered on during recovery in order to prevent hosts from being placed in a standby state. It resets DPM cluster settings to their original state after the recovery has successfully completed. This process with DPM occurs whether or not you have DPM enabled for your cluster.
VM recovery starts only after SRM has finished with bringing all ESXi hosts out of standby mode. These standby ESXi hosts are powered on concurrently. This power-on process creates an overhead, which is the maximum amount of time taken to bring anyone host out of standby mode. In general, the aforementioned overhead is relatively small compared to the gain in recovery time performance due to the availability of more ESXi hosts. Since this overhead does not vary as the number of VMs increases, you will reap more performance benefits if you have a larger number of protected VMs.
- Key takeaways:
- More hosts lead to more concurrency for recovering VMs and so results in shorter recovery time.
- When protecting VMs (creating protection groups) ensure that the recovery site hosts mapped under the respective inventory are in a proper powered-on state; otherwise, SRM will not use those hosts to create placeholder VMs.
- SRM places DPM in a manual mode while VMs are being powered on during recovery in order to prevent hosts from being placed in a standby state. It resets DPM to its original state after the recovery has successfully completed.
High Priority and Suspending Virtual Machines
In a recovery plan, the virtual machines being recovered can be assigned to five different priority groups. SRM also provides the functionality of setting dependencies across individual VMs.
The recovery time does increase with an increase in the dependency across individual VMs or groups of VMs. This applies to both real and test recoveries.
Chaining groups of VMs together is a better idea than chaining individual VMs. This means that "priority groups" should be used first to determine the dependency and startup order of the VMs instead of using the "VM dependencies" functionality directly because individual VMs starting up sequentially affect the RTO. Grouping VM dependencies in priority groups is usually the best and the safest idea because VMs within each priority group will be started in parallel.
You can also configure SRM to suspend VMs during a recovery. Note that suspending VMs is a resource-intensive operation that might take time to complete depending upon the configuration of the VMs being suspended. Your overall RTO might increase if a lot of VMs are being suspended. However, the upside to this is that suspending VMs frees up platform resources, which can then be used to recover other VMs that are part of your recovery plan.
- Key takeaways:
- It is important to chart out the dependencies and priorities between virtual machines to be recovered so that only a certain number of required virtual machines can be assigned individual dependencies. Such dependencies impact recovery time.
- Configuring VM dependencies across priority groups instead of setting per VM dependencies is highly recommended because VMs within each priority group will be started in parallel.
- Suspending virtual machines on the recovery site will also impact recovery time.
Advanced Settings/VMware Tools
SRM offers certain advanced settings that you can configure on each site. These settings can affect the performance of general SRM operations and critical operations like test and real recoveries.
To configure advanced settings:
1. From the vSphere Client, right click the site on which you want to configure a particular setting.
Figure 9. Advanced Settings page
Here are some settings that you might want to be aware of:
- Whenever the protected site inventory changes, SRM will perform a new LUN Group computation. If you are adding multiple VMs to the datacenter or changing any inventory in general, you can get SRM to wait before doing another computation by setting storage.minDsGroupComputationInterval to the approximate time taken to make the change. For example, setting storage.minDsGroupComputationInterval to a number value tells SRM that there should be at least that many seconds in between any two consecutive LUN Group Computation tasks. This setting is intended to make LUN Group Computation tasks less frequent when there are a lot of inventory changes going on.
- VMware strongly recommends that VMware Tools be installed in all protected virtual machines. Many SRM recovery operations depend on the proper installation of VMware Tools in the protected virtual machines to carry out the following tasks:
- Wait for the OS heartbeat while powering on the virtual machine and wait for a network change while reconfiguring the recovered virtual machine. SRM depends on VMware Tools to report the OS heartbeat and completion of the network change. However, if you do not have VMware Tools installed on any of the protected virtual machines, you can choose to set the timeout values for recovery.powerOnTimeout and recovery.customizationTimeout to zero (0).
- Wait for virtual machines to shut down on the protected site.
- During a planned migration, Site Recovery Manager tries to gracefully shut down the virtual machines on the protected site. Before Site Recovery Manager forcibly powers a virtual machine off, it tries to shut down the guest OS. If your intention is to power off the virtual machines without gracefully shutting down the guest OS, you can set recovery.skipGuestShutdown to true in the Advanced Settings menu
NOTE: If the VMs do not have VMware Tools installed and the guest shutdown timeout is set to a non-zero value, then your recovery will not proceed beyond the “Shutdown VMs at Protected Site” step. When your VMs do not have VMware Tools installed, you will have to set recovery.skipGuestShutdown to true if you want your recovery plan to make any progress.
Specify a nonreplicated datastore for swap files
Every virtual machine requires a swap file, which is normally created in the same datastore as the other virtual machine files. When you use SRM, this datastore is replicated. To prevent swap files from being replicated, create them on a non-replicated datastore.
If you are using a non-replicated datastore for swap files, you must create a non-replicated datastore for all protected clusters at both the protected and recovery sites. To do so:
- In the vSphere Client, right-click an ESXi cluster and click Edit Settings.
- In the Settings window for the cluster, click Swapfile Location and select Store the swapfile in the datastore specified by the host, and then click OK.
- For each host in the cluster, select a nonreplicated datastore.
- Click the Configuration tab.
- On the Swapfile Location line, click Edit.
- In the Virtual Machine Swapfile Location window, select a nonreplicated datastore and click OK.
Recovery Time Advantages
SRM makes remote calls to vCenter Server for deleting any swap files found on a replicated datastore during a recovery on the recovery site. If swap files reside on non-replicated datastores, then this step is skipped, which speeds up the recovery. This will also avoid wasting network bandwidth during replication between the two sites. This is especially true if you’re using NFS storage.
Recommendations
VMware vCenter Site Recovery Manager provides advanced capabilities for disaster recovery management, non-disruptive testing, and automated failover. The following performance recommendations have been made in this paper:
- It is recommended that the SRM database be installed as close to the SRM server as possible, such that it reduces the round-trip time between both of them. This way recovery time performance will not degrade because of round trips to the database server.
- It is a good practice to have fewer but larger NFS volumes so that the time taken to mount a large number of such volumes decreases during the recovery. This might also translate to fewer protection groups on your setup leading to reduced recovery time.
- It is a good practice to have DRS enabled on a recovery site.
- More hosts lead to more concurrency for recovering VMs which results in shorter recovery time.
- Also, before protecting VMs, bring recovery site hosts out of standby mode so that they get leveraged for creating placeholder VMs.
- Configuring VM dependencies across priority groups instead of setting per VM dependencies is usually the best idea because VMs within each priority group will be started in parallel.
- It is strongly recommended that VMware Tools be installed in all protected virtual machines in order to accurately acquire their heartbeats and network change notification. Refer to Advanced Settings/VMware Tools for more information.
- Make sure any internal script or call-out prompt does not block recovery indefinitely.
- Specify a non-replicated datastore for swap files. This avoids wasting network bandwidth during replication between two sites and reduces remote calls to vCenter Server during recovery to delete swap files for all VMs, which in turn helps in speeding up the recovery