February 09, 2023

vVols with NVMe - A Perfect Match

Engineering article from Ashutosh Sarawat describing the benefits of running vVols on NVMe-FC.

vSphere 8.0 vVols supports NVMe-FC (NVMe over Fibre Channel) protocol in the data path. This means that vVols are capable of supporting all the commonly used data protocols, be it SCSI, iSCSI, NFS, and now NVMe-FC.

This support has been added via releasing a new VASA Spec, aka VASA version 4, which adds details into the VASA specification on how to support the NVMe protocol with vVols.

The vVols implementation of NVMe protocol extracts the best from both the VASA control path & NVMe IO path for the vVols implementation.
This means that all the details of the NVMe protocol are hidden from the storage/vSphere admin.
Further, the end-to-end configuration of NVMe is also automated, i.e., from obtaining discovery controller information running on the target array to performing discovery of IO controllers and connecting to the namespaces behind those IO controllers, everything is fully automated. A vSphere admin won’t even need to know any of NVMe details in order to configure and use NVMe with vVols.

 

Overview

NVMe namespaces represent distinct storage capacity, accessed independently for IO, with persistent reservations and multi-pathing being tracked on a per namespace basis, and have a flat structure with no hierarchies between namespaces. These properties of namespaces map very well with vVols that share much of the same definition on vSphere. Hence we get 1:1 mapping of vVols to namespaces on NVM subsystems that support vVols. ANA groups in NVMe provide a construct which map with the PE construct for SCSI and hence were chosen to represent vPE(virtual Protocol Endpoint).

image-20230209133701-2

 

vVols, NVMe and Namespaces

SCSI vVols solved the issue of scaling the number of vVols accessed on a host with the notion of a PE LUN, with the corresponding need to bind subsidiary LUNs to PE LUNs. The primary factor that drives this is managing multi-pathing for vVols in use on a host. The NVMe protocol, by design, does not support a notion of hierarchies between namespaces, which is slightly different from SCSI.

The design for NVMe vVols, therefore, leverages an NVMe-defined ANA Grouping construct that supports a similar consolidation of the multi-pathing and path state management for groups of namespaces in much the same fashion as PE LUNs. While the grouping construct is not advertised as a PE from the storage array, the ESXi host identifies such namespace groups and emulates a PE-like construct on the host that is introduced in VASA 4.0 as a virtual Protocol Endpoint or virtual PE. The vPE is a purely host-side entity and maps to the namespace groups as defined on the NVMe array. The existing architecture for vVols on the ESXi, which associates vVols to protocol endpoints on the host, works the same as up to VASA 3.0 and compatible vSphere versions.

 

vPE aka virtual Protocol Endpoint

Virtual Protocol Endpoints for NVMe serve the same purpose as PE LUNs for SCSI vVols, which is to support multi-pathing and tracking access state per path for collections of vVols. Given there is no concept or scope for an array-defined PE LUN type construct in NVMe, the approach is to model ANA Groups defined in the NVMe specification for the purpose of supporting multi-pathing and tracking per-path (controller) access state for collections of vVols.

The ANA Group construct fits this requirement where;

  1. The ANA state of an ANA Group represents the ANA state of all member namespaces.
  2. The array reports ANA state on a per path (controller) basis.
  3. The array notifies changes to ANA state to the host that has attached any member namespaces in an ANA Group.
  4. The array also notifies changes to namespace membership in ANA Groups, allowing the host to detect migration of namespaces across groups.

 

As far as the host is concerned, ANA Groups serve the exact same purpose as PE LUNs with regards to supporting multi-pathing for collections of vVols without having to track asymmetric access state on a per-vVol basis. Hence, the ESXi host identifies ANA Groups in NVM subsystems (used for vVol namespaces) and represents them  as virtual Protocol Endpoints or vPEs.

image-20230209135414-3

Further, since a namespace can move from one ANA group to another, we also support dynamic rebind and thus the vPE a vVol is associated with is not static. We also support inline or in-band migration of a vVol from one vPE to another.

 

Key Advantages of NVMe with vVols

One Click Deployment of vVol NVMe-FC Datastore

Mounting a vVols datastore, which is backed by NVMe protocol, does not need any configuration/steps on the vSphere side apart from the basic network configuration. All we need is a VASA Provider that supports VASA version 4.0, register the VASA Provider onto the vCenter and then proceed for mounting the vVols container as usual. All the NVMe specific handling and information exchange happens via VASA control path.

Here is a high level overview of the workflow that gets executed beneath the sheets.

 

As seen above, the ESXi host and the Array/VASA Provider share NVMe specific data over VASA Session and establish a working data path.

First, the host shares its Host NQN and Host-ID with the array so that the array can grant access to the relevant NVMe subsystems.
Then the array shares the information with the host on how to connect to NVMe discovery service controllers. The ESXi host then connects to those discovery service controllers and fetches the information about available IO controllers.
Later it will connect to IO controllers and establish a working data path over NVMe.

Note: ESXi host uses a different Host-NQN and Host-ID for NVMe vVols. New vVol-NVMe datastore are mounted.

vVols with NVMe completely avoids any configuration to be done on the ESXi host manually. All ESXi host needs is a compatible adapter and networking set-up, so that it may talk with the array.

 

Automated Lifecycle Management of NVMe Controllers

Another key advantage with vVols is lifecycle management of the NVMe controllers that a host is connected to with a target.
For traditional VMFS storage, the vSphere admin needs to keep track of the NVMe subsystems that the ESXi host needs to connect to or remain connected to as long as the datastore is in use.
Once the datastore is not in-use, the vSphere admin then must disconnect those controllers.

Now, this becomes a huge problem over time, as someone has to maintain a mapping of NVMe subsystems to the VMFS datastore. Any change in the configuration again has to be updated and manually tracked. The ESXi host can connect only to a limited number of controllers. Thus, over time, if the controllers which are not in use are left, then the ESXi host will not be able to connect to newer NVMe IO controllers. This problem does not exist with vVols and NVMe, as the vVol, with the help of VASA APIs, vSphere will do the bookkeeping as well as manage the lifecycle of the connections to the target NVMe controllers. Subsequently, as new vVol-NVMe datastore are mounted to an ESXi host, it will discover and connect to the required NVMe controllers and also when the datastores are unmounted it will disconnect from the controllers. This becomes an absolute must in large scale environments.

 

Allows multiple NVMe subsystems and IO controllers to back ONE single vVols datastore

A traditional VMFS datastore is mapped to a namespace. This means that one VMFS datastore can be backed by only one NVMe subsystem. Whereas in the case of vVols, there are no such limitations. Multiple NVMe subsystems can back one vVols NVMe datastore, this allows for better scaling and load distribution. This also provides more flexibility to array vendors to support vVols. For example, if an array scales through adding NVM subsystems or an array vendors that may create a 'distributed array' over multiple physical arrays but using separate NVM subsystems per array.

 

Automated Discovery of New NVMe Subsystems & Removal of Stale NVMe Subsystems

The vVols with NVMe exploit and fully utilize the VASA control path to get information about the addition and removal of NVMe subsystem. This means any change in the NVMe subsystem, the array can inform the ESXi host via VASA event, the ESXi host receives this event and, as a result, either connects to additional subsystems or disconnect from stale-older subsystems. This ensures ESXi is always just connected to the right set of NVMe subsystems.

 

Dynamic Load Balancing

The array can reassign namespaces between ANA Groups, while they are still attached to controllers. The use cases, for example, may include load-balancing between components (nodes) on the array or recovering namespaces in the event of component failure on the array. This is a vendor-specific behavior but gets notified to the host in-band via Asymmetric Namespace Access Change AEN (03h) to the host on the controller(s) where the change was detected. 

This is somewhat similar to the rebind workflow in VASA 2.0 and 3.0 for SCSI vVols. For SCSI vVols, the VASA provider notifies the host to rebind a bound vVol with a REBIND event, to migrate a subsidiary LUN from one PE LUN to another. This entails re-acquiring persistent reservations on the target PE LUN plus a handshake with the target PE LUN to verify that it is ready to accept IOs on the subsidiary LUN.

With NVMe, the workflow is entirely in-band and hence the VASA event-based rebind is not used for NVMe vVols. 

 

Faster & Efficient Cloning

NVMe protocol unlike SCSI does not yet have a version of xCopy that allows the copy of blocks across namespaces. While the NVM spec is now adding this, the capability will still be limited to namespaces within the same controller. As such, whenever we need to copy or clone data, the ESXi host will have to READ and WRITE data from the array to create a copy. This means when a clone of a disk is needed, ESXi will have to employ compute(CPU), storage(IO-bandwidth), and network(network-bandwidth) resources. This means fewer resources for the running VMs.

For NVMe with vVols, we use VASA control path to offload clone or copy operations, consequently, VM clone will be much more efficient and quicker for vVols. Therefore, vVols over NVMe shine here where clones will not only be faster but also have less impact on the ESXi.

 

 

 

Author: Ashutosh Sarawat is one of VMware's vVols Staff Engineers and has been working on capabilities and performance.

 

Filter Tags

Storage ESXi vSphere NVMeoF Virtual Volumes (vVols) Blog Document Announcement Technical Overview Intermediate