PVRDMA and vSphere Bitfusion — The Unpriceable Lightness of RoCE Road
The world is getting better! Just read Steven Pinker's "Enlightenment Now", eschew the newspapers and television, and you will become optimistic. But the very best "Everything is Awesome" metric is the 1992 William Nordhaus paper working out the proper economics for how technology improves (lowers!) prices. He detailed how modern lighting (1992 CFL bulbs) had become 500,000 times cheaper than the campfires cavemen sat around in antiquity. (https://lucept.files.wordpress.com/2014/11/william-nordhaus-the-cost-of-light.pdf)
The awesomeness we want to talk about in this blog, however, is RoCE (RDMA over Converged Ethernet). More specifically, we will explain how to use this modern network protocol between Bitfusion nodes.
But neither will we forget Nordhaus. As we advance through the steps, we will also look back at a few of his labor-for-lumens results. For reference, a wax candle emits about 13 lumens and 100-watt incandescent bulb about 1200. Nordhaus calculated the price in hours of labor to produce 1000 lumens of light for one hour.
So adjust the brightness of your screen and read on.
The bright dawn of Rocky
Introduction — I hold with those who favor fire
Let wood fires or a Babylonian sesame oil lamp be our metaphor for the pre-ROCE baseline. It took about 58 hours of labor (with a stone axe, presumably) to produce the wood for 1 hour of 1000 lumen light. On the other hand, growing, gathering, and pressing sesame seeds to produce the oil for the same amount of light in 1750 BC only took 41.5 hours (Welcome to the future, Mesopotamia! The fertile crescent shines like a full moon.)
The reason for discussing PVRDMA and Bitfusion together is that Bitfusion connects clients and servers together over the network and PVRDMA is the VMware virtual implementation of efficient RDMA networks. The objective of this blog is to explain how to set up a PVRDMA network (the particulars are sometimes difficult to find), and then using it with Bitfusion nodes.
Bitfusion is a software product that allows AI/ML applications to run on client machines or containers ( including containers in TKG or K8S), but to access GPU acceleration from remote servers. Bitfusion clients intercept an application's CUDA calls and send them to Bitfusion servers that have physical GPUs. A pretty sensible question is how well does an application perform given the network latency to reach a remote GPU. Much of Bitfusion's "secret sauce" is keeping a pipeline of GPU work filled and thus hiding the network latency. Nevertheless, since this pipeline cannot always be kept full, Bitfusion recommends and benefits from a low-latency network connection (the specific recommendation is 50 microseconds or less).
RoCE (RDMA over Converged Ethernet) is a network protocol that allows applications running on different servers to access each other's memory over the network, without operating system context changes and intervention (and without copying data into and out of kernel memory buffers). Also, larger MTUs are commonly used in RDMA networking (say 9000 bytes per frame, rather than the 1500 bytes typical of TCP/IP set-ups). Both the direct access and the (optional, but common), large frame sizes, lower the network overhead, latency, and improve performance under Bitfusion.
PVRDMA is VMware's virtual RoCE (or RDMA) adapter and distributed network. It stands for Paravirtualized RDMA. It is the software that brings RDMA to VMs without having to dedicate entire physical adapters to a VM (with DirectPath I/O). PVRDMA works not only between VMs on a single host, but also between multiple hosts. When using virtual adapters (and not DirectPath I/O) and if RDMA-suitable physical adapters are available, Bitfusion recommends PVRDMA over vmxnet3.
In this blog we first create a PVRDMA network that our nodes can use. Then we set up PVRDMA on Bitfusion servers and on Bitfusion client VMs. Finally, we cover the extra steps needed when the Bitfusion client is an NGC container.
Creating a PVRDMA distributed Network — My candle burns at both ends
Tallow candles were a major innovation lighting technology. You could now get your hour of 1000 lumens for 5.4 hours of labor at boiling down, filtering, and molding animal fat.
We are setting up a network on two ESXi hosts with sutiable physical adapters and the necessary physical switches connecting the hosts. All the work for the PVRDMA network is done in vCenter. We do not cover the setup of the physical equipment. This simplified picture shows the physical equipment in the top half, and the view the virtual machines have of the RoCE connections via a distributed virtual switch. It does not show management ports or many other details of the virtual distributed network.
NOTE: the label, RoCE adapter, refers to the Ethernet NICs chosen for RoCE, which can be any Ethernet-capable NIC. It does not refer, necessarily, to any RDMA-special hardware.
Figure 1: PVRDMA distributed network for VMs on two ESXi hosts
1. Create a new Distributed Virtual Switch
In vCenter in the Networking tree, right-click the datacenter (named SC2-DC in the screenshot below) and create a New Distributed Switch.
Figure 2: Create a new virtual distributed switch
We show the first dialog of the new Distributed Switch UI, the rest should be obvious as you fill in the values below and click through with the NEXT button.
Figure 3: New distributed switch dialog
Name: vSwitch1 Choose any name you want
Location: SC2-DC Should match data center name you already clicked on
Version: 7.0.2 - ESXi 7.0.2 and later Choose the versions matching your configuration, but the minimum is 7.0.0
Number of uplinks: 4 This number should be greater than or equal to the number of VMs you'll connect to the networking
Network I/O control: Disabled
Default port group: Create
Port Group Name: BF-PVRDMA-DPORTGROUP Choose any name you want
Figure 4: Port group created
2. Figure out which physical NICs are the right NICs
Figure 5: Identifying physical adapters
Go to Hosts and Clusters
for each host:
- Select the host
- Click the Configure tab, then Networking > Physical adapters
- Note which NIC is the RoCE or RDMA NIC for each host
In the screen shot above, the RoCE NIC was easy to identify because it was also the only 100 Gbe NIC. You may have to use the MAC ID in your environment.
3. Add Hosts to the Distributed Virtual Switch
Figure 6: Adding physical hosts
- Go to Hosts and Clusters
- Click the DataCenter icon
- Select the Networks top tab and the Distributed Switches sub-tab
- Right click "vSwitch1" But use the name of your switch
- Click Add and Manage Hosts
- Go through the Add and Manage Hosts dialog steps, clicking NEXT as you go
Select task: "Add Hosts"
Select hosts: Use the New hosts... button and select your hosts
Manage physical adapters: Select the physical adapters identified above. Then click Assign uplink, choose (Auto-assign) and click OK.
Manage VMkernel adapters: No changes, click through
Migrate VM networking: No changes, click through
Ready to complete: Review and complete the changes
4. Configure hosts for PVRDMA
Tag a vmknic for PVRDMA
Go to Hosts and Clusters
for each host:
- Select the host and go to the Configure tab
- Go to System > Advanced System Settings
- Click Edit
- Filter on "PVRDMA"
- Set Net.PVRDMAVmknic to vmko, include the quotes
Net.PVRDMAVmknic = "vmk0"
Figure 7: Setting for Net.PVRDMAVmknic
Set up the firewall for PVRDMA
NOTE: Don't be tempted to skip this step; it is critical.
Go to Hosts and Clusters
for each host:
- Select the host and go to the Configure tab
- Go to System > Firewall
- Click Edit
Figure 8: Selecting firewall
- Scroll down to find pvrdma and check the box to set the firewall
Figure 9: Firewall checkbox
5. Set the MTU for PVRDMA network
NOTE: This assumes you've configured the switches and physical adapters already for large MTUs and jumbo frames.
- Click the Data Center icon
- Click the Distributed Virtual Switch that you want to set up ("vSwitch1" in our example)
- Go to the Configure tab
- Select Settings > Properties
- Look at Properties > Advanced > MTU. This should be set to 9000. If it's not, click Edit.
- Click Advanced
- Set MTU (e.g. 9000)
- Click OK
Figure 10: Setting the MTU of a distributed network
Creating a Bitfusion Server with a PVRDMA adapter — The lamp burns sure
Not only were whales were delighted to be replaced by kerosene, but humans rejoiced that the labor cost for an hour of 1000-lumen light dropped from 6.332 hours to 3.344 hours.
Bitfusion servers appliances (delivered as OVAs) are PVRDMA-ready. The software, the filesystem, the configuration are ready for PVRDMA adapters. All you need to do is assign the adapter to the machine.
You create a Bitfusion appliance by deploying the Bitfusion server OVA and walking through the steps in the deployment GUI, as described in the Bitfusion Installation Guide. Once the first server is running, you can optionally deploy the subsequent servers with a different, streamlined GUI, called the installation guide as described on this page (for version 3) of the guide. Once a server is running, you can add up to four network adapters to it. The first of these should NOT be the PVRDMA adapter. The second, third, and fourth, if you use that many, can each use PVRDMA (though they must be on separate networks according to Linux constraints).
If using regular OVA/OVF deployment and not the streamlined installer, then before you power up the VM, you need to add all the adapters in the VM's Edit Settings interface in vCenter.
These three screen shots show the networking and PVRDMA steps for the initial, or "regular", deployment method. In this example we also assume that the IP addresses are statically assigned, not DHCP-assigned, so the IP field is filled in.
Figure 11: Network selection for first adapter (non-PVRDMA) of first Bitfusion server OVF deployment
Figure 12: Adapter 2 settings (PVRDMA) of first Bitfusion server OVF deployment
Figure 13: Editing settings for Adapter 2 — PVRDMA — of first Bitfusion server after deployment
This screen shot shows the PVRDMA step in the streamlined installer for second and later server deployments. This example also assumes the PVRDMA network is using static IPs.
Figure 14: Adapter 2 settings for PVRDMA in the Bitfusion Installer for subsequent Bitfusion servers
Creating a Bitfusion Client with a PVRDMA adapter —
Underneath a gas lamp
Where the air's cold and damp
The earliest natural gas lamps were not so impressive, but by the time of the Welsbach mantle in the 1890s, you finally spent more time enjoying the light, than time producing it: eight minutes of work for an hour of 1000-lumen light.
Client machines must have a "regular" adapter (such as vmxnet3) in addition to a PVRDMA adapter.
For convenience, we repeat here the PVRDMA client instructions given in the Bitfusion Installation Guide.
- Locate the virtual machines hosting the vSphere Bitfusion and clients in the vSphere Client.
- Right-click a virtual machine in the inventory and select Edit Settings.
- From the Add new device drop-down menu, select Network Adapter 2.
The New Network section is added to the list in the Virtual Hardware tab.
- Select a PVRDMA network.
- Expand the New Network section and connect the virtual machine to your PVRDMA distributed port group.
- Change the Status setting to Connect at power on.
- From the Adapter type drop-down menu, select PVRDMA.
- Power on the virtual machine.
- If you powered on a virtual machine that is hosting the vSphere Bitfusion clients, install the RDMA drivers.
In addition to the RDMA drivers, diagnostic tools are installed.
○ For CentOS and Red Hat Linux, run the following command.
sudo yum install -y open-vm-tools rdma-core libibverbs libibverbs-utils infiniband-diags automake autoconf rdma-core-devel pciutils-devel libtool
○ For Ubuntu Linux, run the following command.
sudo apt-get install -y rdma-core libmlx4-1 infiniband-diags ibutils ibverbs-utils rdmacm-utils perftest
Unfortunately, additional steps are required on typical linux machines before you can use the RDMA drivers. We list them here.
First, if you are using static IP addresses, you need to configure the interface.
Configure the interface on Ubuntu 20.04
ip addr # Examine the output and determine which interface is the RDMA interface. In our example, it is called ens224f0 cd /etc/netplan ls # Examine the files and determine which you want to edit to assign the IP # For example, 00-installer-config.yaml sudo netplan try # sets up the network as specified by files, but only for 120 seconds, then reverts back. Use `--timeout N` for something other than the 120 default seconds sudo netplan apply # permanently sets up according to the yaml files. # At this point reboot and test the interface: run 'ip addr' and then use 'ping' to make sure the address is set and working.
Configure the interface on CentOS 8
ip addr # Examine the output and determine which interface is the RDMA interface. In our example, it is called ens224f0 cd /etc/sysconfig/network-scripts ls # In our case, there was no file configuring ens224f0, so we copied an existing config file and edited it as follows: cat ifcfg-ens224f0 TYPE=Ethernet PROXY_METHOD=none BROWSER_ONLY=no BOOTPROTO=none IPADDR=192.168.30.201 NETMASK=255.255.255.0 DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=no NAME=ens224f0 UUID=61f63fd2-e5ff-4158-94b2-5db494b0e11e DEVICE=ens224f0 ONBOOT=yes # At this point reboot and test the interface: run 'ip addr' and then use 'ping' to make sure the address is set and working.
Load kernel modules and load the vmw_pvrdma module last
Now, you need to address possible issues with the installation of the kernel modules for the RDMA. Some differences may be seen between CentOS and Ubuntu, but the following should get both systems working.
Above, the RDMA packages were installed, but you need to run modprobe to ensure they are all actually running as kernel modules. And the order can be important. The interface port(s) will not come up until unless module vmw_pvrdma is loaded after all the others. For Ubuntu systems, you will have to remove and reload this module each time the VM is booted (this is a known issue with these packages, but as of this writing, the world is awaits a fix).
First, we run ibv_devinfo and see that the RDMA port(s) is(are) down (shown in yellow).
ibv_devinfo hca_id: rocep19s0f1 transport: InfiniBand (0) ... phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) ... link_layer: Ethernet
Now, load the modules.
# Modules that need to be loaded for PVRDMA-based Infiniband to work. # Note: CentOS systems may already have the port in and active state (not PORT_DOWN), # but not all of the modules may be loaded, so run these commands just in case sudo /sbin/modprobe mlx4_ib sudo /sbin/modprobe ib_umad sudo /sbin/modprobe rdma_cm sudo /sbin/modprobe rdma_ucm # Once those are loaded, reload the vmw_pvrdma module sudo /sbin/modprobe -r vmw_pvrdma sudo /sbin/modprobe vmw_pvrdma # on Ubuntu systems save this initram state; skip the next line on CentOS sudo update-initramfs -k all -u # Every time you reboot ubuntu (CentOS systems do not seem to require this), run these two commands again: sudo /sbin/modprobe -r vmw_pvrdma sudo /sbin/modprobe vmw_pvrdma
You can re-run ibv_devinfo to verify the port(s) is(are) now active.
ibv_devinfo hca_id: rocep19s0f1 transport: InfiniBand (0) ... phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) ... link_layer: Ethernet
You have successfully enabled vSphere Bitfusion to use PVRDMA network adapters.
You can test the connection between the vSphere Bitfusion server and client by using the ib_send_bw (InfiniBand send bandwidth) command. We already installed the perftest package on the Ubuntu systems (above), but for CentOS 8 we need to build it by hand in order to get a version compatible with Bitfusion servers.
Build perftest on CentOS 8
git clone https://github.com/linux-rdma/perftest.git cd perftest git checkout v4.4-0.11 # You should get the old v4.4-0.11 tag of perftest (which gives you protocol version 5.60 compatibility) ./autogen.sh ./configure make clean && make V=1 sudo make install
Now we are ready to run ib_send_bw. We assume the IP addresses of the vSphere Bitfusion server and client are 192.168.30.231 and 192.168.30.11, respectively.
#From the server 192.168.30.231 ib_send_bw --report_gbits #From the client 192.168.30.11 - connects to the server ib_send_bw --report_gbits 192.168.30.231
The server and client, both, write a bandwidth report, in Gbps, to stdout.
Using PVRDMA in an NGC Container — With flame of incandescent terror
First came carbon filaments, then tungsten, then gradual improvements through the decades. By 1990 an hour of incandescent 1000-lumen light required 2.16 seconds of labor.
This section is for those using NGC containers. If you are not using NGC containers you can skip ahead to the exciting conclusion of this blog. NGC AI/ML images are prepared with the Mellanox OFED ibverbs RDMA libraries (MOFED), but not the open source ibverbs libraries supported by PVRDMA. Worse, the one set of libraries is incompatible with the other. fortunately, the cure though is not completely terrible—just delete the MOFED libraries and install the open source libraries.
Ubuntu 20 Example
For example purposes, we work here with the nvcr.io/nvidia/tensorrt:20.12-py3 Docker image, which can run a TensorRT application on Ubuntu 20.04.
We learn the UBUNTU_CODENAME which we'll need later, and we find the list of MOFED DEB packages for later removal. These items are highlighted below.
bfuser@bf-client-rho-ub20:~$ sudo docker run -it --rm -u root --ipc=host --privileged --net=host --cap-add=IPC_LOCK --pid=host nvcr.io/nvidia/tensorrt:20.12-py3 ... root@bf-client-rho-ub20:/workspace# ls -l /opt/mellanox/DEBS/* lrwxrwxrwx 1 root root 9 Nov 21 2020 /opt/mellanox/DEBS/4.9-0.1.7 -> 5.1-2.4.6 lrwxrwxrwx 1 root root 9 Nov 21 2020 /opt/mellanox/DEBS/5.0-0 -> 5.1-2.4.6 lrwxrwxrwx 1 root root 9 Nov 21 2020 /opt/mellanox/DEBS/5.0-1.1.8 -> 5.1-2.4.6 lrwxrwxrwx 1 root root 9 Nov 21 2020 /opt/mellanox/DEBS/5.0-2.1.8 -> 5.1-2.4.6 lrwxrwxrwx 1 root root 9 Nov 21 2020 /opt/mellanox/DEBS/5.1-0.6.6 -> 5.1-2.4.6 lrwxrwxrwx 1 root root 9 Nov 21 2020 /opt/mellanox/DEBS/5.1-2.3.7 -> 5.1-2.4.6 -rwxrwxr-x 1 root root 856 Nov 21 2020 /opt/mellanox/DEBS/add_mofed_version.sh /opt/mellanox/DEBS/5.1-2.4.6: total 552 -rw-rw-r-- 1 root root 120736 Nov 21 2020 ibverbs-providers_51mlnx1-1.51246_amd64.deb -rw-rw-r-- 1 root root 55224 Nov 21 2020 ibverbs-utils_51mlnx1-1.51246_amd64.deb -rw-rw-r-- 1 root root 325948 Nov 21 2020 libibverbs-dev_51mlnx1-1.51246_amd64.deb -rw-rw-r-- 1 root root 53860 Nov 21 2020 libibverbs1_51mlnx1-1.51246_amd64.deb root@bf-client-rho-ub20:/workspace# exit
We derive the names of the DEB packages from those last four lines: ibverbs-providers, ibverbs-utils, libibverbs-dev, libibverbs1
Now we can create a Dockerfile to remove the MOFED packages, install the Ubuntu RDMA packages, then install Bitfusion.
NOTE: The Dockerfile does not install a Bitfusion license token, so this is meant to run on a VM that vCenter already authorized as a Bitfusion client. See the bare metal section of this blog if you need an alternate way to authorize the client.
# Base this container on the NGC container you want to use FROM nvcr.io/nvidia/tensorrt:20.12-py3 # Remove the Mellanox OFED packages that are installed, for Ubuntu 20.04, # determined previously by running “ls -1 /opt/mellanox/DEBS/*” in the base container RUN apt-get purge -y ibverbs-providers ibverbs-utils libibverbs-dev libibverbs1 # Install the Ubuntu RDMA packages using the # UBUNTU_CODENAME from /etc/os-release # as the -t argument. RUN apt-get update && apt-get install -y --reinstall -t focal rdma-core libibverbs1 ibverbs-providers infiniband-diags ibverbs-utils libcapstone3 perftest # Build the TensorRT sample projects WORKDIR /workspace/tensorrt/samples RUN make -j4 # Install needed Python deps RUN /opt/tensorrt/python/python_setup.sh # Download the mnist dataset, etc. WORKDIR /workspace/tensorrt/data/mnist RUN python download_pgms.py # Install the Bitfusion 3.0.1-4 client software for Ubuntu 20.04 # Be sure to replace the Time Zone (TZ) value of "America/Los_Angeles if you need to WORKDIR /workspace RUN wget https://packages.vmware.com/bitfusion/ubuntu/20.04/bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb RUN apt-get update && DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get install -y ./bitfusion-client-ubuntu2004_3.0.1-4_amd64.deb # If you want to run rather than go interactive add the following # (and we are hardcoding the IP address, presumably, of a Bitfusion server that possesses a PVRDMA adapter #WORKDIR /workspace/tensorrt/bin #RUN bitfusion run -n 1 -l 192.168.30.231 -- ./sample_mnist
To build (with the -t option, put its name before the colon and the tag after)
sudo docker build -t ubuntu2004-tensorrt:20.12-py3-bf-pvrdma .
bfuser@bf-client-rho-ub20:~$ sudo docker images
To run, and assuming we are targeting a Bitfusion server possessing a PVRDMA adapter at address 192.168.30.231:
bfuser@bf-client-rho-ub20:~$ sudo docker run -it --rm -u root --ipc=host --privileged --net=host --cap-add=IPC_LOCK --pid=host ubuntu2004-tensorrt:20.12-py3-bf-pvrdma ... root@bf-client-rho-ub20:/workspace# cd /workspace/tensorrt/bin root@bf-client-rho-ub20:/workspace/tensorrt/bin# bitfusion run -n 1 -l 192.168.30.231 -- ./sample_mnist
Conclusion — Shall I compare thee to a summer's day?
Modern LED bulbs came out after Nordhaus's study, but my scratchpad estimates that 0.35 seconds of labor provides you an hour of 1000-lumen light.
Shakespeare is as mute as any subsequent poet on LED lighting, so I scribbled out some doggerel more pointed than the line officially quoted.
I'm sorry for old Edison,
His fil'ment's dark and dead.
You can lead a horse to water.
But a light bulb must be LED
(with apologies to my father who joked with a less-ambiguously pronounced last word that you can lead a horse to water, but a pencil must be lead.)
The trajectory of moving from TCP/IP to RoCE, via PVRDMA won't be quite as dramatic as moving from campfire light to LED bulbs, but it does give noticeable, measurable improvement. Setting up a PVRDMA Distributed Virtual Switch is the longest part of the procedure. Conversely, attaching the Bitfusion servers is very simple. Bringing up PVRDMA ports on linux clients requires the right packages and work with the kernel modules. The sometime last part, using PVRDMA within an NGC container, takes a bit of understanding, but in the end is just a few lines in a Dockerfile. Thus, you have Kundera's contrast between heaviness and lightness, but a PVRDMA triumph regardless.
Rocky bests all contenders