VMware vSphere Bitfusion Release 2.5.0 Delivers the Feature Mix You Wanted – Romeo and Juliet Sing Simon and Garfunkel
Introduction – Are you going to bite your thumb at Scarborough Fair?
We are happy to announce the availability of VMware vSphere Bitfusion version 2.5.0 focusing on new support and features. today, November 5, 2020. The support for bare metal clients alone is a fair cause for celebration and a reason there will be no thumb biting at Bitfusion 2.5.0. This blog explains the bare metal support, multi-version support, and the updated health checks, along with other improvements.
Bare Metal Client Support– Looking for fun and feeling a rose by any other name would smell as groovy.
Support for bare metal clients has been one of the most frequent requests we have heard from customers. Well, it has arrived.
Anxious for bare metal release
Bare metal client support is super groovy because it is more than just bare metal. The client VMs inside the dashed lines, in Figure 1 below, are the only clients Bitfusion previously supported. The dark purple arrows indicate how they were enabled. Clients machines had to run under the same vCenter Server as the Bitfusion servers. The Bitfusion plug-in enabled client VMs and containers, except that you had to follow a manual process for TKG containers.
Release 2.5.0 now has a way to create tokens to enable the clients that were out-of-scope before, and which replaces the more cumbersome manual TKG process. Now, you can:
- More easily enable Bitfusion clients on TKG containers
- Enable Bitfusion clients (and any containers) running under a different vCenter Server
- Enable bare metal clients (and their containers)
The light purple arrows in Figure 1 show these new capabilities. The Bitfusion plug-in generates authorization tokens and creates a tarfile that can be unpacked on any Bitfusion client.
Figure 1 – Bare metal Bitfusion clients (and others) are enabled by vCenter server tokens
Let’s look at how that is done in vCenter and then we’ll discuss how you can distribute the tokens narrowly or widely and what you can do to manage them.
When you have a machine or container you cannot enable as a Bitfusion client directly from vCenter, Go to the new “Tokens” tab on the plug-in instead, as shown in Figure 2. Then, click on “NEW TOKEN”,
Figure 2 – Tokens tab in vCenter server
In the following dialog box, make up a brief description for the new token. In Figure 3 the description is, “Blog Token”. Hit the “CREATE” button.
Figure 3 – Creating a token
In Figure 4, you can see the newly created token. Hit the “DOWNLOAD” action and save the tar file in the name and path of your choice.
Figure 4 – Downloading the token for deployment
Copy the tar file into the filesystem of your client machine. Next, extract its three files and copy them to their proper destinations. In the example command-line snippets below, the tarfile is named blogtoken.tar.
[bf_user@bf_user-client-centos7 Tokens]$ ls blogtoken.tar [bf_user@bf_user-client-centos7 Tokens]$ tar xvf blogtoken.tar client.yaml ca.crt servers.conf [bf_user@bf_user-client-centos7 Tokens]$ ls blogtoken.tar ca.crt client.yaml servers.conf [bf_user@bf_user-client-centos7 Tokens]$ sudo cp ca.crt /etc/bitfusion/tls/. [sudo] password for bf_user: [bf_user@bf_user-client-centos7 Tokens]$ cp client.yaml ~/.bitfusion/. [bf_user@bf_user-client-centos7 Tokens]$ cp servers.conf ~/.bitfusion/.
That’s it! The bare metal client (or container, etc.) is ready to go. Below we quickly test the authorization with a bitfusion list_gpus command and by running a sample CUDA application.
[bf_user@bf_user-client-centos7 Tokens]$ bitfusion list_gpus - server 0[172.16.31.224:56001]: running 0 tasks |- GPU 0: free memory 15109 MiB / 15109 MiB |- GPU 1: free memory 15109 MiB / 15109 MiB - server 1[172.16.31.234:56001]: running 0 tasks |- GPU 0: free memory 15109 MiB / 15109 MiB - server 2 (leader) [172.16.31.214:56001]: running 0 tasks |- GPU 0: free memory 15109 MiB / 15109 MiB [bf_user@bf_user-client-centos7 Tokens]$ cd /usr/local/cuda/samples/0_Simple/matrixMul [bf_user@bf_user-client-centos7 matrixMul]$ bitfusion run -n 1 -- ./matrixMul Requested resources: Server List: 172.16.31.224:56001 Client idle timeout: 0 min [Matrix Multiply Using CUDA] - Starting... GPU Device 0: "Tesla T4" with compute capability 7.5 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 406.81 GFlop/s, Time= 0.322 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
One token can enable multiple clients
In the example above, we created a single token and used it to enable a single client. But you can copy the token’s tar file to as many clients as you wish and enable all of them with the same toke. Or you can create multiple tokens and deploy the tar files to subsets of the clients.
The “Tokens” screenshots should make it obvious that you can disable/enable and delete tokens, too. Since a token supports multiple clients, you have the ability to disable groups of clients with a single click of the mouse. Or you can disable one set of clients while leaving other sets up and running, as long as they use different tokens.
Multi-Version Support – Hello, darkness, what light through yonder window breaks my old friend?
This release sees the dawn of backwards compatibility for Bitfusion. Clients running Bitfusion. v2.0.x will work with v2.5.0 Bitfusion servers.
There are not a lot of details to discuss. Once all the Bitfusion servers are running v2.5.0, you can upgrade your clients, individually, at your leisure.
The Clients tab in the plug-in has added a column showing which clients are running the current version of Bitfusion (green icon), which are running a deprecated version (yellow icon), and which are running an older, non-supported version (red icon). It also has an “Uninitialized” icon (gray) for clients whose version hasn’t been seen by the servers yet—these are clients that have not run any application yet. See Figures 5 & 6.
Figure 5 – New Client tab with version column. Green is current, yellow is deprecated, red is unsupported (not shown), gray is uninitialized
Figure 6 – Tooltip showing the meaning of version icon
It’s important to note, however, what multi-version does not support. All Bitfusion servers must still run the same version of Bitfusion; server versions cannot be mixed. Upgrading servers still requires that all servers be powered down, and then upgraded and brought on line one at a time. Nevertheless, there are some improvements, to this upgrade procedure which is documented in the Installation Guide. Updates may deserve their own blog.
Supported Versions – Put it in your pantry of more woe than this of Juliet with your cupcakes
New features are good. But updates have tentacles to many other software components. Bitfusion 2.5.0 has cleared enough shelf space in your version compatibility nightmare storage closet to ensure moving to recent AI/ML frameworks and newer GPUs will be a piece of cake.
Version 2.5.0 offers support for newer drivers, libraries, and frameworks. Here’s the list.
- NVIDIA driver 450 (note: required for NVIDIA Ampere A100 support)
- CUDA 11
- Validated support for TensorFlow 2.3
- Validated support for PyTorch 1.5
- Validated support for TensorRT 7.1.3
Health Checks – You're shaking my confidence, thus with a kiss I die daily
A hypochondriac may love the way health checks can set their worries in concrete, but the rest of us use them to fix problems and confidently proceed to our real work. Bitfusion version 2.5.0 makes significant improvements to its health checks. These include a few new health checks, a major stylistic change, and a way to suppress reports you wish to ignore.
We do not want to fully document each check in this blog. But here are four new health checks:
- All nodes running—cluster healthy
- Database and Bitfusion agree on the node count
- Enough free space on disk
- vCenter connection is working
In the old days (previous releases), the health check was strictly a command-line tool. Even in the GUI (the vCenter Server Bitfusion plug-in) all that was done was to echo the command-line output into a dialog box as text. Figure 7 shows the old health check in the GUI.
Figure 7 – Old GUI health check displayed command-line stdout
Now, the checks produce a proper data structure, allowing the GUI to treat each check as an individual item. Each check displays its controls, status icons, name, etc. in different columns for easier parsing. It also provides a brief explanation of each check. You can navigate to the health display as before, by clicking the “Health” action from the “Servers” tab as shown in Figure 8. The new health check dialog is shown in Figure 9.
Figure 8 – Navigation to the GUI health check
Figure 9 – New GUI health check has individual items and is easier to parse
The word, “suppression” has mostly negative connotations. Whenever you are caught with a suppressor on your pistol, for example, sneaking into the neighbor’s house, everyone assumes you are up to no good. However, there are times when papering over a flaw so you can pretend it doesn’t exist is a perfectly helpful self-delusion.
On review, the previous paragraph does not appear to offer a strong argument for suppression. So, let’s look at a specific example.
In Figure 9, we see all the health checks pass except for the first two, which have status, MARGINAL. When the marginal and passing checks are combined into a single field in the highlighted line of Figure 8, it reflects the highest severity of all checks, and displays as “Marginal”. Now, assume you have examined these two MARGINAL checks and have decided they are not something you need to address. [In this case, the first check merely indicates there was a network error or drop that occurred at some point in time. You cannot clear this error count except by rebooting. If no new network errors are accumulating, you can ignore the warning. The second check recommends you use a larger MTU, which may not be possible in your deployment] You would likely want to suppress these checks. You would want only to be alerted if some other check was MARGINAL or FATAL.
Now, the good news is that suppressing a health check does not actually stop the check from being performed. It only stops it from being considered when combining all health checks into the single health field for the server.
Set the checks' control in the first column to “Ignored”, as in Figure 10.
Figure 10 – Setting two health checks to "Ignored"
Click on the “SAVE AND EXIT” button, wait for a refresh (10 seconds by default), and see that the server now has a health status of “Healthy”, as in Figure 11.
Figure 11 – Server passes health check after suppressing known acceptable condition
Verbosity Consoles – Like a bridge where civil blood makes civil hands unclean over troubled water
To learn your product’s common user errors takes time—or if you are in a hurry, you can bridge that time barrier by letting loose a large number of common users. Now, since no one wants to be “in error” or “common”, developers ameliorate their users’ plight with preventative and diagnostic steps. The Bitfusion server start-up script has increased its error messaging. The Linux console (stderr) now reports and makes it easier to diagnose the following common errors:
- Hostname contains invalid characters (Checks on the Linux naming restrictions)
- NVIDIA driver issues
- GPUs not detectable (pass-through may not have been set up correctly)
- DNS IP address issues
- Misconfigured network adaptors (IP, MTU, netmask, gateway
- Network configurations without a corresponding adaptor
But Wait, There’s More – His bow tie is as boundless as the sea is really a camera
Releases include a lot more work than you will typically perceive. There are “quiet” improvements, bug fixes, and minor features. Sometimes work is done just improve code, or to make it easier to maintain, or to lay groundwork for future features. Here, we keep that work concealed, except for a few, brief notes.
- Bitfusion Plug-in “About” tab now displays version information
- Incremental performance improvements
- Support script to gather logging information if you need to work with VMware support
- Log in to any Bitfusion server and run
- sudo bitfusion-supportbundle.sh
- Look for the output in /tmp/bitfusion-supportbundle.tar.gz
Conclusion – For a poet, these violent delights have violent ends and a one-man band
So, there it is, vSphere Bitfusion version 2.5.0 offers bare metal client support, client backwards compatibility, compatibility for improved health and diagnostic checking, and more. If you think these are sufficiently important and numerous, feel free to strike up some celebratory music.
One final announcement, two new hands-on labs for Bitfusion are available for you to try it out.
- HOL-2147-91-ISM - Using Bitfusion GPU virtualization in vSphere - Lightning Lab
- HOL-2147-02-ISM - Using Bitfusion GPU virtualization in vSphere
Visit the Bitfusion landing page for documentation and software if you would like to conduct and evaluation.
If you have Bitfusion questions you can write to us at AskBitfusion@vmware.com.