Data Protection for VMware vSAN
Data protection is a key component of any business continuity plan. VMware provides a solid foundation for VMware vSAN data protection and disaster recovery in a wide variety of use cases and scenarios. It could be as simple as protecting a few critical workloads in a single location or something larger in scale such as backup, restore, replication and automated recovery capabilities for multiple sites around the world. VMware enables these solutions through a combination of built-in capabilities combined with mature, robust APIs and SDKs that vendors who specialize in data protection can utilize and build upon.
Features that are native to VMware vSphere and vSAN can be combined with third-party solutions to build a comprehensive data protection strategy based on your specific business needs and technical requirements. Solutions are available for on-premises and cloud-based deployments. For example, a third-party product can be used to back up and quickly restore individual virtual machines while VMware vSphere Replication and VMware Site Recovery, disaster recovery as a service (DRaaS) available with VMware Cloud on AWS, provides recovery from site outages when needed. The number of solutions and flexibility provided by VMware and its partners is unparalleled.
This document provides information on data protection features built into vSphere and vSAN. Details are also included on how VMware works with its partner ecosystem to provide data protection solutions for nearly any environment running workloads on HCI deployments powered by vSAN.
VMware vSphere, using VM snapshots, provides the ability to capture a point-in-time state and data of a virtual machine. This includes the virtual machines’ storage, memory, and other devices such as virtual NICs. Snapshots are useful for creating a point-in-time state and data of a VM for backup or archival purposes and for creating test and rollback environments for applications. Snapshots can capture virtual machines that are powered-on, powered-off, or even suspended. When the virtual machine is powered on, there is an option to capture the virtual machine’s memory state and allow the virtual machine to be reverted to a powered-on point in time.
For further information about how to use virtual machine snapshots in vSphere environments, see Using Snapshots to Manage Virtual Machines.
vsanSparse snapshots are a new on-disk format originated from VirstoFS technology. This always-sparse filesystem using a 512-byte block size instead of 1MB block size on VMFS-L provides the basis for a new snapshot format. Using the underlying sparseness of the filesystem and a new, in-memory metadata cache for lookups, vsanSparse offers greatly improved performance compared to previous virtual machine snapshot implementations.
Compared with redo log based snapshots as you would see on VMFS, vSAN uses vsanSparse delta objects. The vsanSparse snapshot format provides vSAN administrators with enterprise-class snapshots and clones. The goal is to improve snapshot performance by continuing to use the existing redo logs mechanism but now utilizing an “in-memory” metadata cache and a more efficient sparse filesystem layout.
Unlike VMFS Snapshots, creating a vsanSparse snapshot does not create redo logs. Instead, when a snapshot is taken, additional “delta” objects get created which are identical to the base objects in vSAN. There are metadata structures that are local to every snapshot level of a VMDK that contain information about blocks that are newly written at that particular level.
In-memory read cache
With vSAN, VMs are made up of objects. A delta disk (snapshot) object is made up of a set of grains, where each grain is a block of sectors containing virtual disk data. A VMDK object backs each delta. The deltas keep only changed grains, so they are space-efficient.
In the diagram below, the Base disk object is called Disk.vmdk and is at the bottom of the chain. There are three snapshot objects (Disk-001.vmdk, Disk-002.vmdk, and Disk-003.vmdk) taken at various intervals and guest OS writes are also occurring at various intervals, leading to changes in snapshot deltas.
- Base object – writes to grain 1,2,3 & 5
- Delta object Disk-001 – writes to grain 1 & 4
- Delta object Disk-002 – writes to grain 2 & 4
- Delta object Disk-003 – writes to grain 1 & 6
A read by the VM will now return the following:
- Grain 1 – retrieved from Delta object Disk-003
- Grain 2 – retrieved from Delta object Disk-002
- Grain 3 – retrieved from the Base object
- Grain 4 – retrieved from Delta object Disk-002
- Grain 5 – retrieved from the Base object - 0 returned as it was never written
- Grain 6 – retrieved from Delta object Disk-003
Consider the case when a snapshot has been taken of a virtual machine. When a guest OS sends a write to disk, the vsanSparse driver receives the write. Writes always go to the top-most object in the snapshot chain. When the write is acknowledged, the vsanSparse driver updates its “in-memory” metadata cache and confirms the write back to the guest OS. On subsequent reads, the vsanSparse driver can reference its metadata cache and on a cache hit, immediately locate the data block.
Reads are serviced from one or more of the vsanSparse deltas in the snapshot tree. The vsanSparse driver checks the “in-memory” metadata cache to determine which delta or deltas to read. This depends on what parts of the data were written at a particular snapshot level. Therefore to satisfy a read I/O request, the snapshot logic does not need to traverse through every delta of the snapshot tree but can go directly to the necessary vsanSparse delta and retrieve the data requested. Reads are sent to all deltas that have the necessary data in parallel.
On a cache miss, however the vsanSparse driver must still traverse each layer to fetch the latest data. This is done in a similar way to read requests in that the requests are sent to all layers in parallel.
Things to keep in mind while using snapshots:
- Typically, snapshots are used temporarily for point-in-time copy of a virtual disk to provide a quick rollback during change window. Snapshots are also used by backup tools to allow point time backups without interrupting the normal operation of the VM.
- Snapshots will never grow larger than the size of the original base disk. However, the size of the delta will be dependent on the number of changes made since the snapshot was previously taken.
- Proactively monitor the vSAN datastore capacity and read cache consumption on a regular basis when using snapshots intensively on vSAN.
- VMware supports the full maximum chain length of 32 snapshots when vsanSparse snapshots are used.
- Even with the improved snapshot capabilities with vSAN, the recommendation is to have a few snapshots for a short term.
For more information on vsanSparse snapshots, see the vsanSparse Tech Note.
Backing up and restore Kubernetes cluster resources
Velero (formerly Heptio Ark) gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. Velero lets you:
- Take backups of your cluster and restore in case of loss.
- Migrate cluster resources to other clusters.
- Replicate your production cluster to development and testing clusters.
Velero consists of:
- A server that runs on your cluster
- A command-line client that runs locally
You can run Velero in clusters on a cloud provider or on-premises. For detailed information, see Compatible Storage Providers.
Each Velero operation -- on-demand backup, scheduled backup, restore -- is a custom resource, defined with a Kubernetes Custom Resource Definition (CRD) and stored in etcd. Velero also includes controllers that process the custom resources to perform backups, restores, and all related operations. You can back up or restore all objects in your cluster, or you can filter objects by type, namespace, and/or label.
Velero is ideal for the disaster recovery use case, as well as for snapshotting your application state, prior to performing system operations on your cluster (e.g. upgrades).
For more information go to velero.io
VMware vSphere® Replication™ (VR) is a hypervisor-based, asynchronous replication solution for vSphere virtual machines. It is fully integrated with VMware vCenter Server® and VMware vSAN. vSphere Replication delivers flexible, reliable, and cost-efficient replication to enable data protection and disaster recovery for virtual machines in your vSphere environment. Replication is configured per virtual machine, allowing precise control over which workloads are protected.
The vSphere HTML5 client is used to configure replication for a virtual machine. Replication for one or more virtual machines can be selected and configured via the same workflow. When configuring replication, an administrator specifies items such as the virtual machine storage policy, RPO, VSS or Linux file system quiescing, network compression, and encryption of replication traffic. Virtual machine snapshots are not used as part of the replication process unless VSS quiescing is enabled.
The target location for vSphere Replication can be within the same vCenter Server environment or in another vCenter Server environment with vSphere Replication deployed.
The same vSphere Replication deployment can replicate some virtual machines to a local vCenter Server environment and other virtual machines to a remote vCenter or to VMware Site Recovery for VMware Cloud on AWS.
NOTE: The same virtual machine cannot be replicated to multiple targets.
After replication has been configured for a virtual machine, vSphere Replication begins the initial full synchronization of the source virtual machine to the target location. The time required to complete this initial synchronization can vary considerably and depends primarily on how much data must be replicated and how much network bandwidth is available.
A copy of the VMDKs to be replicated can be created and shipped to the target location and used as “seeds,” reducing the time and network bandwidth consumed by the initial full synchronization. When replication begins, vSphere Replication compares the source virtual disks' universally unique identifiers (UUIDs) and the target “seed” copies.
After the initial full synchronization, changes to the protected virtual machine are tracked and replicated regularly. The transmissions of these changes are referred to as “lightweight delta syncs.” Their frequency is determined by the RPO that was configured for the virtual machine. A lower RPO requires more-frequent replication.
The vSphere Replication user interface provides information such as status, last synchronization duration and size, configured RPO, and which vSphere Replication server is receiving the replicated data.
As mentioned earlier, the components that transmit replicated data—the vSphere Replication agent and a vSCSI filter—are built into vSphere. They provide the plug-in interfaces for configuring and managing replication, track the changes to VMDKs, automatically schedule replication to achieve the RPO for each protected virtual machine, and transmit the changed data to one or more vSphere Replication virtual appliances.
Data is transmitted from the source vSphere host to either a vSphere Replication management server or vSphere Replication server and is written to storage at the target location. The replication stream can be encrypted. As data is being replicated, the changes are first written to a file called a redo log, which is separate from the base disk. After all changes for the current replication cycle have been received and written to the redo log, the data in the redo log is consolidated into the base disk. This process helps ensure the consistency of each base disk so virtual machines can be recovered at any time, even if replication is in progress or network connectivity is lost during transmission
The process of replicating to VMware Site Recovery for VMware Cloud on AWS is identical to normal vSphere Replication operation.
When configuring replication for a virtual machine, an administrator can enable the retention of multiple recovery points (point-in-time instances). This can be useful when an issue is discovered several hours or even a few days after it occurred. For example, a replicated virtual machine with a 4-hour RPO contracts a virus, but the virus is not discovered until 6 hours after infestation. As a result, the virus has been replicated to the target location. With multiple recovery points, the virtual machine can be recovered and then reverted to a recovery point retained before the virus issue occurred.
The maximum number of recovery points that can be retained is 24. The following are some examples:
- Three recovery points per day over the last 5 days (15 recovery points)
- Five recovery points per day over the last 2 days (10 recovery points)
- Four recovery points over the last 6 days (24 recovery points)
The number of recovery points that can be retained per day depends on the RPO—specifically, on the number of replication cycles that occur during the day. For example, retaining eight recovery points per day is impossible if the RPO is set to 4 hours, with six replication cycles per day. Retaining multiple recovery points consumes additional storage at the target location and must be planned for accordingly. The additional storage requirements depend on the number of recovery points retained and the data change rates in the source virtual machines
This makes it possible to recover from damage or corruption that isn’t immediately apparent. vSphere Replication also supports traffic encryption, traffic compression, and guest OS quiescing. These capabilities make vSphere Replication more secure, flexible, and capable.
For additional information about vSphere Replication, see:
VMware APIs for Data Protection
APIs for Data Protection
VMware Storage APIs—Data Protection (VADP) uses the Virtual Disk API and a subset of vSphere APIs to create and manage snapshots of virtual machines running on ESXi hosts. VADP and VDDK are used by data protection vendors to create and manage virtual machine snapshots. Virtual machine snapshots are a key component that enables backups of virtual disks with minimal disruption.
Virtual machine backups are commonly done by creating a snapshot of the virtual machine’s virtual disk(s) to obtain a static image of the virtual machine. The virtual machine continues to process IO with the snapshot in place. Writes to a virtual disk are redirected to a redo log. The changes captured in redo log are consolidated into the virtual disk when the backup of the virtual disk is complete, and the snapshot is removed. This approach enables a quick and clean backup operation.
Figure 1. Virtual machine snapshot during a backup
An incremental backup mechanism called Changed Block Tracking (CBT) is also provided. CBT tracks disk sectors that change between snapshots. A Change ID is set and incremented each time a snapshot is taken. This provides data protection vendors the ability to perform full backups (all of the data) and incremental backups (changed data since the last backup).
Recent versions of vSphere include the vSphere APIs for IO Filtering (VAIO) SDK. VAIO is a framework that enables the development of filters that run in ESXi to intercept IO requests from a guest operating system to a virtual disk. The intercepted IO is split off for use cases such as caching and replication. VAIO makes it possible to achieve very low RPOs (measured in seconds) with asynchronous replication without introducing latency to the IO path or "stuns" commonly associated with redo log-based virtual machine snapshots. Software vendors can use VAIO to build replication products for data protection and disaster recovery solutions. One example of this is Dell EMC RecoverPoint for Virtual Machines.
VMware Partner Ecosystem
VMware Ready for vSAN
The process and mechanisms used to back up virtual machines on vSAN are nearly identical to other vSphere datastore types. Backup solutions use the same APIs and snapshots to backup and restore virtual machines that reside on vSAN. Agents installed in the guest operating system work the same regardless of the underlying hardware (virtual or physical) and storage.
It is best to choose a solution that has been validated and is fully supported with vSAN. VMware makes this simple through the VMware Ready for vSAN program. This program provides a set of tools, resources, and processes to VMware partners so that they can certify their products with vSAN. Customers can easily find what data protection solutions have been tested and certified for use with vSAN. The list of certified products is part of the VMware Compatibility Guide.
Data Protection Partners
There are many VMware partners that provide mature, robust data protection solutions for VMware virtual machines. These solutions use the VADP snapshot-based method for backing up virtual machines with minimal or no disruption. Some of these solutions also include agents that can be installed for file-level and application-level backups and restores. Agent backups do not require virtual machine snapshots.
Determining which vendor is right for your environment depends on your requirements. These solutions all have common capabilities such as being able to back up and restore a VMware virtual machine—the virtual disks and the rest of the files that make up a virtual machine (VMX file, NVRAM file, and so on). Most solutions have unique offerings, as well. These differences can be seen in areas such as the availability of backup agents, specific application compatibility, options for where backup data is stored, support for physical workloads (in addition to virtual machines), backup methods that can be used, and the user interface to name a few. It is important to gather technical and business requirements before selecting a data protection solution for your environment.
The other key consideration is support. Data protection vendors provide customer support for their solutions. VMware simply provides the APIs for these vendors to use in their solutions. A common question is “What backup software does VMware support?” Technically, the answer is “None.” Customer support is provided by the data protection vendor. Therefore, it is very important to verify the data protection solution you are considering is compatible and fully supported—by the backup vendor—with your specific environment.
As mentioned previously, there are many data protection solutions for VMware platforms. This provides you with a wide variety of solutions to choose from based on your environment, requirements, and budget.
Data Protection Partner Assets
General Data Protection Recommendations
VMware provides APIs for vendors to use when building data protection products. These are standard APIs available to all data protection vendors. How these APIs are implemented in various data protection solutions often varies. That means the recommended best practices for one solution might differ. You should consult with your data protection vendors for specific recommendations for your environment and their solution. Below is a short list of general recommendations and observations.
- Document your business and technical requirements and use the VMware Compatibility Guide to help narrow the list of potential data protection solutions.
- Conduct a short proof-of-concept to validate that the solutions on your list meet your requirements and function as desired.
- Agent-less, snapshot-based virtual machine backups are enough in most cases. However, certain applications, such as mail servers and databases, can benefit from using in-guest backup agents. It is common for organizations to perform a combination of snapshot (agent-less) backups and agent backups.
- If possible, maintain two copies of your backup data—one on-site, another off-site. The on-site copy helps facilitate faster restores. The off-site copy is for disaster recovery.
- Perform frequent test restores to validate your backup solution is working properly.
- A well-rounded business continuity strategy includes a data protection (backup and restore) solution and a ransomware/disaster recovery plan. Blog article on this topic
Data Protection for vSAN Summary
When it comes to data protection and disaster recovery of vSAN, VMware provides a solid foundation of products and solutions. vsanSparse snapshots use in-memory caching to improve performance and stability. vSphere Replication enables efficient replication of virtual machines for disaster recovery with RPOs as low as five minutes. vSAN stretched clusters provide a resilient infrastructure with an RPO of zero across sites. A core set of VMware APIs make it easy for a diverse ecosystem of hardware and software partners to develop data protection solutions.
These features and products provide a wide variety of backup and disaster recovery options. You can choose one or multiple products to build a comprehensive data protection strategy based on your specific business needs and technical requirements.