Fast Start Script for Bitfusion Clients — Quickdraw in the Wild, Wild West
There's a new script in town. Everyone can get their first Bitfusion client running a machine learning application in record time.
And as we tell you all about it, we tease the classic Western, Once Upon a Time in the West. The movie uses every meme known to the genre, but the drama is human and tragic, and the plot still surprises. So we cast this blog in all the major roles.
Problem
You need a great villain. Frank is the chief enforcer for a railroad tycoon. He is asked why he killed folks in the railroad's way instead of just scaring them off.
“People scare better when they're dying.” — Frank
I often hear requests for a Bitfusion client OVA. The Bitfusion servers are delivered as OVAs, so why does VMware not deliver an example client OVA that already has everything installed to run something like a TensorFlow application under Bitfusion? It would be a good way for machine learning newbies to get started, wouldn't it? Well, yes, it would. However, because of various licensing concerns, we cannot package up everything you need.
In conjunction with the vSphere Bitfusion 4.5 release, we have done the next best thing. We have created a startup script, client_vm_starter.sh, that installs everything you need to turn a VM into Bitfusion client and run a TensorFlow benchmark. It should work for bare metal machines, as well.
Solution
You need a mysterious hero. A drifter dubbed Harmonica appears for a meeting with Frank, but is met by three threatening henchmen on three horses instead. They say they're shy one horse when Harmonica asks if they brought one for him.
“You brought two too many.” — Harmonica
A script can be powerful without a lot of dialog.
Many machine learning applications have a dependency on the CuDNN library from NVIDIA. While anyone can create an NVIDIA developer's account and download the library, you can't make a script that will download it without the login information. CuDNN is, however, available in the Ubuntu 20.04 repo, so, for now anyway, this script is only for systems running Ubuntu 20.04. For those running other distributions and releases, the script can serve as a recipe. You can download the script from https://packages.vmware.com/bitfusion/scripts/client_vm_starter.sh, and its use is described in the Bitfusion 4.5 User Guide.
The Four Paths
Complications stack up and options close. You need unexpected allies. Cheyenne is an escaped outlaw who takes up sides with Harmonica.
“Do you just play or can you shoot too.” — Cheyenne
There are four main pathways through this script. We'll show them graphically, describe them briefly, and then point you to the help option which describes them in detail (and unlike this blog, will stay up-to-date as the script evolves).
Figure 1: Four installation paths of Bitfusion Client VM Starter script: just Bitfusion, an ML app and dependencies, Bitfusion plus app and dependencies, Docker.
1. Bitfusion
Downloads and installs:
- Bitfusion client software (and nothing else)
2. Bitfusion and the application and dependency panorama
Downloads and installs:
- Bitfusion client software
- CUDA runtime libraries
- CuDNN
- TF
- TF benchmarks (the application)
3. The application and dependency panorama, but not Bitfusion
Downloads and installs:
- CUDA runtime libraries
- CuDNN
- TF
- TF benchmarks (the application)
4. Docker (so you can run Bitfusion and the application and dependency panorama in a container)
Downloads and installs:
- Docker
— This path can be done stand-alone, or it can be tacked on to any of the other three paths with the `-d` option
(and indicated in the diagram by the gray line.
Help
Once you have downloaded the script —
wget https://packages.vmware.com/bitfusion/scripts/client_vm_starter.sh
— you can simply run —
client_vm_starter.sh -h
— to get help with the options and parameters of the script.
Uninstall
Above we discussed the four pathways through the script, but we should mention that it also has parameters to uninstall components. It errs on the side of caution when removing packages, however, and may not remove items that could have been installed independently of and prior to running the script.
Activate the Client
Installing Bitfusion on a client is not enough to allow you to allocate GPUs from the Bitfusion server.
Be sure to follow the steps in the Bitfusion installation guide to activate the client or generate an authorization token for it.
Datasets
You need someone to rescue. Jill was promised a new start out west, but ran into those who would take away even that.
“The last man who told me that... is buried out there.” — Jill
One of the things this script does NOT do is download a dataset. The application, TensorFlow benchmarks, will synthesize data for you if you do not specify a dataset, so in that way, this client starter script is self sufficient. On the other hand, if you wish to test Bitfusion with real data, these lines will grab the CIFAR-10 dataset and run the benchmark with it.
cd ~ mkdir datasets cd datasets wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz tar -xvzf /home/bitfusion/downloads/cifar-10-python.tar.gz cd ~ # or to the directory where the TensorFlow benchmarks were installed when you ran the script bitfusion run -n 1 -- python3 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --num_gpus=1 --batch_size=64 --model=alexnet --data_dir=~/datasets/cifar-10-batches-py/ --data_name=cifar10"
A future release of this script will likely have an option to download a dataset.
Docker
The climax. You want all mysteries explained, the plot packaged up, and the bad guys to see their end. But you don't want spoilers, so I am only tossing out the merest gossamer of hint.
“Make your ever lovin' brother happy.” — Frank
Since this starter script will also install Docker for you, I thought it would be good to supply a Dockerfile to create an image that will run the same benchmark with the same dataset as above.
FROM nvcr.io/nvidia/tensorflow:20.11-tf2-py3 MAINTAINER James Brogan <someone@somewhere.com> # Set initial working directory WORKDIR /home/bitfusion/downloads/ # Update package list RUN apt-get -y update # Install Bitfusion for Ubuntu20.04 RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_4.0.1-5_amd64.deb RUN apt-get install -y ./bitfusion-client-ubuntu1804_4.0.1-5_amd64.deb # Must run list_gpus to pull in env and tokens RUN bitfusion list_gpus # Get the CIFAR-10 dataset tar file RUN wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz # TF benchmarks WORKDIR /home/bitfusion/ RUN git clone https://github.com/tensorflow/benchmarks.git # Set working directory to TensorFlow Benchmarks WORKDIR /home/bitfusion/benchmarks/ RUN git checkout cnn_tf_v2.1_compatible # Set working directory and extract CIFAR-10 WORKDIR /home/bitfusion/ RUN tar -xvzf /home/bitfusion/downloads/cifar-10-python.tar.gz RUN echo "bitfusion run -n 1 -- python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --num_gpus=1 --batch_size=64 --model=alexnet --data_dir=./cifar-10-batches-py/ --data_name=cifar10" > README_ExampleCmd
Note: this Dockerfile uses an underlying NGC Ubuntu 18.04 image. This still works.
Note: this Dockerfile assumes an underlying VM that has been activated (in vCenter Bitfusion plug-in) as a Bitfusion client. If you run Docker on a different machine, you will need to modify it to use a Bitfusion token (which can also be generated in the vCenter Bitfusion plug-in).
To create and run the image and then obtain a command line to run the benchmark, perform the following (and you can make up an image name other than bf4-cifar10-tfbench-ub18, if you wish):
sudo docker build -t bf4-cifar10-tfbench-ub18 . sudo docker run -it --rm --privileged --pid=host --ipc=host --net=host bf4-cifar10-tfbench-ub18 # [Now the command-line prompt changes to the container's prompt] cat README_ExampleCmd # [Cut and paste, then run the example command just cat'd out] bitfusion run -n 1 -- python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --num_gpus=1 --batch_size=64 --model=alexnet --data_dir=./cifar-10-batches-py/ --data_name=cifar10
Denouement
Those capable of dealing with the bad, are not quite able of dwelling with the good. They ride off into the sunset.
“Will you come back someday?” — Jill
“Someday.” — Harmonica
While we cannot supply a Bitfusion client OVA, the new client_vm_starter.sh should make it quite easy to set up a viable Bitfusion client and quickly run a machine application on remote GPUs. This should be a significant step forward for those evaluating Bitfusion and for new users.