The new release of VMware vSphere Bitfusion, version 3.0.0, concentrates on usability improvements.
Science discovers knowledge by observation, hypothesis, and elimination of falsehood. A "fill-in-the-topic" history month does so by appreciation—so we do not take for granted that which may continue to help us in the future.
This blog reveals Bitfusion's latest secrets. And its seven science-themed headlines are clues for which history month is March.
Multi-Network Support — Barbara McClintock Tracks Meiosis, Crossovers A-maize
As germ cells are produced, genes may swap places between homologous chromosomes. These crossovers require a network of physical links between the chromosomes. With multiple links you get better genetic recombination and chances that someone in your species will survive changes in the environment. With a significant overburdening of the metaphor, Bitfusion has improved multi-networking support and is ready for new environments customers are requesting.
Let's recap the Bitfusion client/server architecture. Clients run AI/ML apps using the servers for GPU acceleration. A Bitfusion plug-in in vCenter Server lets you manage and monitor the clients and servers. Clients, servers, and vCenter talk to each other over the network.
There are two big reasons you might ask for multiple networks.
- Bandwidth. If you expect two or three clients to be using GPUs on some Bitfusion server, you would like to keep them from bottlenecking each other.
- Security or isolation. You may not want your clients on the same network as vCenter. You may want to keep the GPUs isolated from non-client machines. You may have clients running under a separate vCenter instance.
This picture shows one possible configuration.
One Bitfusion server offers its GPUs over two networks
Bitfusion from its first GA appearance allowed you to create servers with up to four networks, but, sadly, it did not make use of all of them. Release 2.5.1 provided a way (via two new guest variables) to bring two of them into use: one for management and one for data.
With Release 3.0.0 you can use all four without any special variables. The management network interface supports both vCenter traffic and client communication. Up to three other network interfaces support client communication.
To set up multiple networks:
- Specify the network interfaces (up to four)—you must specify the first as you deploy the OVA. You can specify the other three at the same time, or later in the usual VM Edit Settings dialog.
- Connect the four networks (if you use that many) each to a unique subnet.
Note the new format of ~/.bitfusion/servers.conf on a 3.0.0 client. It lists all the IP addresses of all the servers but specifies one which is known to be reachable by the client. The reachable IP will be used for the initial handshake of a Bitfusion session prior to choosing the address for transport during the remainder of the session. In this simple configuration immediately below, there is one network total, and all servers can be reached on it. Later in this blog you will see some more complex examples.
servers: - reachable: 172.16.31.210:56001 addresses: - 172.16.31.210:56001 - reachable: 172.16.31.207:56001 addresses: - 172.16.31.207:56001
The old format of servers.conf is still supported for down-version clients.
Multiple Networks on the Mind
Ping — Henrietta Swan Leavitt Makes Standard Candle, Sees Stars
If you know how much light a particular type of star emits, you have a standard star that can tell you its distance along with its neighbors’. With the standard ping utility, users can explore the nodes of their networks.
By default, Photon systems inactivate the ping response. Since Bitfusion servers are Photon machines, you may be missing the simplest means of verifying connectivity between nodes.
With the 3.0.0 release, ping is enabled on Bitfusion servers out of the box. A small, but very nice change.
Pinging a young Alec Guinness in a white suit was easier than in a later, most desperate hour
Nvidia-smi — Marie Curie Isolates Elements, Positively Radiant
If parents gifted you a radioluminescent watch in the 1970s, you were much safer than with earlier ones that contained Radium. But Tritium's half-life of 12.32 years, in retrospect, meant these watches, too, were not a bright idea. Thanks, Mom and Dad. On the other hand, It is sure nice to tell our elements apart and what their properties are. The tool that this fifth and final topic sentence at last names, nvidia-smi, tells you details about individual GPUs installed on your system.
This utility gets installed with the NVIDIA driver. When you run it from the command line, it supplies status and some management of your system's NVIDIA GPUs. This is handy when you want to know if and how your applications are using the GPUs.
Under Bitfusion, you can use it to query remote GPUs. For example, if one of your Bitfusion server's URL were 192.168.10.222, you could run this command.
bitfusion run -n 2 -s 192.168.10.222 nvidia-smi
But on a Bitfusion client, you have not installed the NVIDIA driver (well, probably not) and may not have a copy of nvidia-smi.
With release 3.0.0 Bitfusion ensures that clients have a copy of nvidia-smi. When you allocate GPUs from some server, you’ll get a copy of its nvidia-smi in your ~/.bitfusion/bin (but only if you don't already have a copy elsewhere—typically in /usr/bin). This is done with the 'run' and 'request_gpus' commands.
bitfusion run -n 1 -- /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul
bitfusion request_gpus -n 1
bitfusion run -n 1 -- nvidia-smi
servers.conf — Lise Meitner Serves Neutrons, Splits
Two is so often better than one, that even splitting something up can have benefits. Release 3.0.0 now allows private copies of servers.conf alongside the maintained copy.
A long time ago, Bitfusion took responsibility for the creation and automatic updating of a configuration file called servers.conf. The file sits on a client and lists the IP addresses of all the Bitfusion servers that a client is authorized to use. This automation eliminated opportunities for human error, but at the same time eliminated one method of restricting a client's access to a subset of the available servers. Users could still use the '-l' option to provide a restricted list, but they could not define a subset in a file once, and then use it many times.
Release 3.0.0 removes the "overwrite" behavior when and only when the servers.conf file is explicitly pointed to with the '-s' option.
Let's walk through an example. Our client can reach three servers as shown by the default file.
cat ~/.bitfusion/servers.conf servers: - reachable: 172.16.31.210:56001 addresses: - 172.16.31.210:56001 - reachable: 172.16.31.207:56001 addresses: - 172.16.31.207:56001 - reachable: 172.16.8.216:56001 addresses: - 172.16.31.216:56001 - 172.16.8.216:56001
But if we want to have a private copy of the file that limits us to the one server on the 172.16.8.x subnet, we can do it like this:
cat ~/dont_tread_on_me/servers.conf servers: - reachable: 172.16.8.216:56001 addresses: - 172.16.31.216:56001 - 172.16.8.216:56001
Lastly, we use that file in a Bitfusion run.
bitfusion run -n 1 -s ~/dont_tread_on_me/servers.conf -- /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul Requested resources: Server List: 172.16.8.216:56001 Client idle timeout: 0 min [Matrix Multiply Using CUDA] - Starting... GPU Device 0: "Tesla T4" with compute capability 7.5 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 406.77 GFlop/s, Time= 0.322 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
The part of command output in blue and underlined shows that the desired server and subnet were used.
Versions — Katherine Johnson Runs the Numbers, Is in Orbit
Not only do planets keep moving as we maneuver our astronauts about them, every piece of software seems to fret and strut its own hour upon the marketplace, necessitating constant adjustments.
With each release, various frameworks, drivers, and APIs advance. Bitfusion 3.0.0 is keeping up and adds support for the following:
- NVIDIA Driver 460
- NVIDIA CUDA 11.1
- Tested with TensorFlow 2.4
- Tested with PyTorch 1.6
- Tested with TensorRT 7.1.3
- Tested with PaddlePaddle 2.0
NTP — Mary Leakey Unearths Footprints, Says Old News
This last item actually walks over a past release but is one we haven't pointed out before.
Bitfusion servers need time synchronization to keep the database of GPU usage coherent. They rely on NTP for this synchronization. Release 2.5.1 made changes to the Bitfusion installer and to the Bitfusion service so only the most determined or doomed user can avoid supplying NTP. A quick field sketch would be:
- If you supply the installer with DHCP, Bitfusion will first choose any manually specified NTP servers, then fall back to DHCP configured servers, and lastly, default to Google NTP servers on the internet.
- If you supply a static IP address instead of DHCP, the installer dialog will force you to enter an NTP server address and will fill in the field with the Google NTP as a default.
- If somehow no services are found and working, there is a new health check to alert you that NTP is unavailable.
Conclusion — Jane Goodall discovers relationships, familiar story
Jane Goodall observed great individuality in chimpanzees. They had personalities and differentiated relationships with family and friends. She convincingly argued against charges of anthropomorphizing her subjects. We see in Bitfusion what we see in other software products: to thrive, its core value must be brought to larger environments, it must stay current, and it must be easier to use. Bitfusion 3.0.0 has taken some big steps forward with the multi-network support and with access to ping and nvidia-smi. Stay tuned because I hope to be blogging about another ease-of-use step in the near future.