November 26, 2021

Fast Start Script for Bitfusion Clients — Quickdraw in the Wild, Wild West

vSphere Bitfusion is making available a public script to quickly set up clients with a machine learning application and its dependencies, as well as the Bitfusion software so you can test and learn in a working environment.

There's a new script in town. Everyone can get their first Bitfusion client running a machine learning application in record time.

And as we tell you all about it, we tease the classic Western, Once Upon a Time in the West. The movie uses every meme known to the genre, but  the drama is human and tragic, and the plot still surprises. So we cast this blog in all the major roles.

Problem

You need a great villain. Frank is the chief enforcer for a railroad tycoon.  He is asked why he killed folks in the railroad's way instead of just scaring them off.
“People scare better when they're dying.” — Frank

I often hear requests for a Bitfusion client OVA.  The Bitfusion servers are delivered as OVAs, so why does VMware not deliver an example client OVA that already has everything installed to run something like a TensorFlow application under Bitfusion?  It would be a good way for machine learning newbies to get started, wouldn't it? Well, yes, it would.  However, because of various licensing concerns, we cannot package up everything you need.

In conjunction with the vSphere Bitfusion 4.5 release, we have done the next best thing.  We have created a startup script, client_vm_starter.sh, that installs everything you need to turn a VM into Bitfusion client and run a TensorFlow benchmark. It should work for bare metal machines, as well.

Solution

You need a mysterious hero. A drifter dubbed Harmonica appears for a meeting with Frank, but is met by three threatening henchmen on three horses instead. They say they're shy one horse when Harmonica asks if they brought one for him.
“You brought two too many.” — Harmonica

Shoot-out scene from early in Once Upon a Time in the West

A script can be powerful without a lot of dialog.

Many machine learning applications have a dependency on the CuDNN library from NVIDIA. While anyone can create an NVIDIA developer's account and download the library, you can't make a script that will download it without the login information. CuDNN is, however, available in the Ubuntu 20.04 repo, so, for now anyway, this script is only for systems running Ubuntu 20.04. For those running other distributions and releases, the script can serve as a recipe.  You can download the script from https://packages.vmware.com/bitfusion/scripts/client_vm_starter.sh, and its use is described in the Bitfusion 4.5 User Guide.

The Four Paths

Complications stack up and options close. You need unexpected allies. Cheyenne is an escaped outlaw who takes up sides with Harmonica.
“Do you just play or can you shoot too.
” — Cheyenne

There are four main pathways through this script. We'll show them graphically, describe them briefly, and then point you to the help option which describes them in detail (and unlike this blog, will stay up-to-date as the script evolves).

Four installation paths with the client VM starter script

Figure 1: Four installation paths of Bitfusion Client VM Starter script: just Bitfusion, an ML app and dependencies, Bitfusion plus app and dependencies, Docker.

1. Bitfusion
Downloads and installs:

  • Bitfusion client software (and nothing else)

 

2. Bitfusion and the application and dependency panorama
Downloads and installs:

  • Bitfusion client software
  • CUDA runtime libraries
  • CuDNN
  • TF
  • TF benchmarks (the application)

 

3. The application and dependency panorama, but not Bitfusion
Downloads and installs:

  • CUDA runtime libraries
  • CuDNN
  • TF
  • TF benchmarks (the application)

 

4. Docker (so you can run Bitfusion and the application and dependency panorama in a container)
Downloads and installs:

    • Docker

    — This path can be done stand-alone, or it can be tacked on to any of the other three paths with the `-d` option
    (and indicated in the diagram by the gray line.

    Help

    Once you have downloaded the script —

    wget https://packages.vmware.com/bitfusion/scripts/client_vm_starter.sh

    — you can simply run —

    client_vm_starter.sh -h

    — to get help with the options and parameters of the script.

    Uninstall

    Above we discussed the four pathways through the script, but we should mention that it also has parameters to uninstall components. It errs on the side of caution when removing packages, however, and may not remove items that could have been installed independently of and prior to running the script.

    Activate the Client

    Installing Bitfusion on a client is not enough to allow you to allocate GPUs from the Bitfusion server.
    Be sure to follow the steps in the Bitfusion installation guide to activate the client or generate an authorization token for it.

    Datasets

    You need someone to rescue. Jill was promised a new start out west, but ran into those who would take away even that.
    “The last man who told me that... is buried out there.” — Jill

    One of the things this script does NOT do is download a dataset. The application, TensorFlow benchmarks, will synthesize data for you if you do not specify a dataset, so in that way, this client starter script is self sufficient. On the other hand, if you wish to test Bitfusion with real data, these lines will grab the CIFAR-10 dataset and run the benchmark with it.

    cd ~
    mkdir datasets
    cd datasets
    wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xvzf /home/bitfusion/downloads/cifar-10-python.tar.gz
    cd ~ # or to the directory where the TensorFlow benchmarks were installed when you ran the script
    bitfusion run -n 1 -- python3 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --num_gpus=1 --batch_size=64 --model=alexnet --data_dir=~/datasets/cifar-10-batches-py/ --data_name=cifar10"

     A future release of this script will likely have an option to download a dataset.

    Docker

    The climax. You want all mysteries explained, the plot packaged up, and the bad guys to see their end. But you don't want spoilers, so I am only tossing out the merest gossamer of hint.
    “Make your ever lovin' brother happy.
    ” — Frank

    Since this starter script will also install Docker for you, I thought it would be good to supply a Dockerfile to create an image that will run the same benchmark with the same dataset as above.

    FROM nvcr.io/nvidia/tensorflow:20.11-tf2-py3
    
    MAINTAINER James Brogan <someone@somewhere.com>
     
    #  Set initial working directory
    WORKDIR /home/bitfusion/downloads/
     
    # Update package list
    RUN apt-get -y update
     
    # Install Bitfusion for Ubuntu20.04
    RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_4.0.1-5_amd64.deb
    RUN apt-get install -y ./bitfusion-client-ubuntu1804_4.0.1-5_amd64.deb
    # Must run list_gpus to pull in env and tokens
    RUN bitfusion list_gpus
    
    # Get the CIFAR-10 dataset tar file
    RUN wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    
    # TF benchmarks
    WORKDIR /home/bitfusion/
    RUN git clone https://github.com/tensorflow/benchmarks.git
    #  Set working directory to TensorFlow Benchmarks
    WORKDIR /home/bitfusion/benchmarks/
    RUN git checkout cnn_tf_v2.1_compatible
    
    #  Set working directory and extract CIFAR-10
    WORKDIR /home/bitfusion/
    RUN tar -xvzf /home/bitfusion/downloads/cifar-10-python.tar.gz
    
    RUN echo "bitfusion run -n 1 -- python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --num_gpus=1 --batch_size=64 --model=alexnet --data_dir=./cifar-10-batches-py/ --data_name=cifar10" > README_ExampleCmd
    

    Note: this Dockerfile uses an underlying NGC Ubuntu 18.04 image.  This still works.

    Note: this Dockerfile assumes an underlying VM that has been activated (in vCenter Bitfusion plug-in) as a Bitfusion client. If you run Docker on a different machine, you will need to modify it to use a Bitfusion token (which can also be generated in the vCenter Bitfusion plug-in).

    To create and run the image and then obtain a command line to run the benchmark, perform the following (and you can make up an image name other than bf4-cifar10-tfbench-ub18, if you wish):

    sudo docker build -t bf4-cifar10-tfbench-ub18 .
    sudo docker run -it --rm --privileged --pid=host --ipc=host --net=host bf4-cifar10-tfbench-ub18
    
    # [Now the command-line prompt changes to the container's prompt]
    
    cat README_ExampleCmd
    # [Cut and paste, then run the example command just cat'd out]
    bitfusion run -n 1 -- python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --num_gpus=1 --batch_size=64 --model=alexnet --data_dir=./cifar-10-batches-py/ --data_name=cifar10

    Denouement

    Those capable of dealing with the bad, are not quite able of dwelling with the good. They ride off into the sunset.
    “Will you come back someday?
    ” — Jill
    “Someday.” — Harmonica

    While we cannot supply a Bitfusion client OVA, the new client_vm_starter.sh should make it quite easy to set up a viable Bitfusion client and quickly run a machine application on remote GPUs. This should be a significant step forward for those evaluating Bitfusion and for new users.

    Filter Tags

    AI/ML Application Acceleration vSphere Blog Activity Path Quick-Start Tool What's New Intermediate Deploy