Artificial-Intelligence/Machine-Learning (AI/ML) Market
- What is the relationship between AI/ML and GPUs?
AI/ML applications are expanding exponentially in every vertical: finance, medical, automotive, security, retail, manufacturing, and on and on. Machine learning is used for image recognition, natural language processing, identifying patterns in tabular data, and any dataset, really.
AI/ML requires a tremendous amount of computation, but it is easy to perform in parallel. Hardware acceleration, principally GPUs today, is practically a requirement to perform the computation quickly enough in acceptably low power envelopes.
- What are the problems Bitfusion solves?
First, is cost: GPUs are both expensive and underutilized. Some studies show they are used only 15% of the time.
Second is the bare metal barrier": AI/ML started in bare metal silos. To scale AI/ML needs the flexibility of virtualization and the orchestration of modern applications.
Virtualization of GPUs increases utilization and brings flexibility, agility, monitoring, management, and many other benefits. Plus, Bitfusion runs as an application and thus, automatically runs in containerized environments.
- What is Bitfusion?
Bitfusion is GPU virtualization solution for AI/ML workloads. It is a feature of vSphere. It pools the GPU servers and allows clients to allocate whole or partitioned GPUs from the pool across the network.
- What is a Bitfusion server?
A Bitfusion server is an appliance (a VM under vSphere made from an OVA) that runs the Bitfusion server software. Each Bitfusion server requires physical GPUs passed through from the host (with DirectPathIO). All Bitfusion servers must be pooled under a single vCenter instance.
- What is a Bitfusion client?
A Bitfusion client is any VM, bare-metal machine, container, or K8S pod that installs the Bitfusion client software – a DEB or RPM package – and that has been issued a token to authorize its requests to the Bitfusion servers.
- What are the benefits of Bitfusion?
Three of the top benefits are:
- Better utilization of expensive GPUs – Bitfusion dynamically allocates frees GPUs among multiple clients
- Concurrently with GPU partitioning.
- Running AI/ML applications on virtual machines and K8S/TKG environments (yielding better agility, flexibility, reliability, automation, cloning, etc.)
- Monitoring and management of GPU resources
- Better utilization of expensive GPUs – Bitfusion dynamically allocates frees GPUs among multiple clients
- What monitoring and management features does Bitfusion have?
Bitfusion tracks and charts GPU allocation, compute & memory utilization and network traffic on a GPU-server and on a CPU-client basis. The data is kept in a database and is available for export.
Bitfusion tracks and presents cluster health.
Bitfusion lets you apply client allocation limits, idle timeouts, and more. It allows you to backup cluster configurations and set database storage policies.
- Where should I run Bitfusion?
Bitfusion is targeted and tested against a variety of AI/ML applications and frameworks, including TensorFlow, PyTorch, and TensorRT. It is not specific to any industry. Bitfusion runs CUDA applications.
Bitfusion is suitable for training and for inference.
Bitfusion works in production and development environments. Development environments may particularly value Bitfusion’s ability to dynamically allocate and free GPU resources.
Bitfusion works in data centers, the cloud, and at the edge. It may be especially difficult to populate the edge with large numbers of GPUs, so sharing there may be critical.
- What applications does Bitfusion support?
Bitfusion is intended for AI/ML workloads. Bitfusion only runs CUDA applications.
- Why only CUDA applications?
Bitfusion works by running an existing application and intercepting its calls to CUDA, NVIDIA’s computational API. Intercepted calls are sent to remote resources (GPUs) on a Bitfusion server that were allocated when the application was launched.
- Can you run Horizon, or graphical applications?
Bitfusion is NOT for use with graphical applications – the non-CUDA, graphical API calls are not intercepted.
- Can you run real-time applications?
Because of the intermittent and unpredictable variation in network performance/latency, Bitfusion is not suitable for real-time applications (including VDI). If your application has a relatively large “real-time” envelope (say, human-level, conscious reaction time) then Bitfusion can still be considered.
- Do I need to refactor my applications to run under Bitfusion?
No, applications run as is, mostly on the client machine, allocating and accessing memory and storage as before. The only difference is their CUDA calls are virtualized.
- Do Bitfusion servers need to access the storage used by the applications?
No, storage is only accessed directly by the clients. The virtualization of the CUDA calls will transfer necessary data to the devices.
Prerequisites and Compatibility
- What operating systems and what processors does Bitfusion support?
Bitfusion runs on Linux only, and for the x86 instruction set. MS Windows is not on the roadmap. Recent distributions of Ubuntu, RHEL, and CentOS are tested and supported. See the VMware Interoperability Matrix for details.
- What are the network requirements?
You will need 10 Gbps networking or higher (not a functional requirement, but strongly advised for performance). Consider higher bandwidth especially for clients allocating more than two GPUs.
You should target a client-server latency of 50 microseconds or less (again, for performance).
- What network transports are supported?
TCP/IP, RoCE, Infiniband (IB). Note: this includes the VMXNET3 and PVRDMA virtual adapters, along with NICs or HBAs you may pass through.
- What GPUs are supported?
Technically, Bitfusion is made to be compatible with certain versions of the NVIDIA driver, not the GPUs directly. But the GPUs we test against are also listed in the VMware Interoperability Matrix.
- Can my Bitfusion servers have more than one type of GPU?
Yes, even within a single server. We encourage variety within a cluster. However, we recommend caution when installing different GPU types and sizes on a single server. Applications using multiple GPUs may not be written to deal with heterogenous types, nor with the differences in their capabilities. Bitfusion will only allocate the memory of the smallest GPU. These intra-server quirks could be avoided by filtering (see FAQ question on filtering), but will not arise in the first place if all the GPUs on a single server are the same.
- Is there a limit on the number of Bitfusion servers and clients in a cluster?
There are no hard-coded limits to the number of clients nor the number of servers in a Bitfusion cluster. However, we do not expect that the number of servers will scale forever without performance implications. We advise you to perform empirical testing if you wish to scale beyond 20 or 25 servers. We do not see such limits on the number of clients.
- Can I have more than one Bitfusion cluster on a single vCenter instance?
No. At this time, all Bitfusion servers under a vCenter instance form a single Bitfusion pool. The Bitfusion servers may be instantiated on hosts in different vSphere clusters, but they will act as a single cluster from a Bitfusion point of view.
Allocation and Partitioning
- How does a client request GPU resources?
Clients issue a separate Bitfusion command for each application they run (or for a session of runs). The command has a mandatory switch specifying the number of physical GPUs requested and an optional switch for the partition size from each physical GPU.
Clients may have multiple, concurrent runs and sessions.
How many GPUs can a client request?
Any request for some number of GPUs must be satisfied by separate physical GPUs from a single Bitfusion server. However, since a client can run concurrent Bitfusion processes, the aggregate GPU count from a client could run as high as the total GPU count in the pool (though it would be an unusual use case).
- How does Bitfusion implement GPU partitioning?
Bitfusion partitions and isolates the GPU framebuffer memory for each client request. Scheduling of the computation resources is left to the CUDA hardware and software.
- Can I request multiple partitions from a single physical GPU?
No. For any single request, each partition must come from a separate physical GPU (and all on one server).
- Can I filter the resources I request?
Yes. For example, you can request resources after filtering for server address ranges, for GPU type, for RDMA connection, frame buffer size and more.
See the Bitfusion User Guide for more information.
- Can I limit the resources a client may request?
You can set GPU maximums, for all and for individual clients. This scores partials too—if a client has a limit of 2.5 GPUs, for example, it could run three simultaneous jobs allocating, respectively, 1 GPU, 0.75 of a GPU and 0.75 of a GPU, before reaching the limit and before seeing a fourth request fail.
- What happens to allocated GPUs that sit idle?
You may set an idle time limit on a client-by-client basis in the Bitfusion management window (of vCenter). GPUs that sit idle longer than the limit will be deallocated.
- Can the GPU allocation expand or shrink at run time?
No, not for a given application. But applications do not expect this of physical devices either.
Containers, Kubernetes, and TKG
- Does Bitfusion work in containers, Kubernetes, and TKG?
- Do I believe you?
The underlying principle is that Bitfusion software is just another application – and containers were made to contain applications. Since Bitfusion does not have kernel or ESXi components, there is nothing to integrate – just containerize it. Kubernetes and TKG just sit at the layer above, orchestrating multiple containers.
Further, we do a large amount of Bitfusion quality assurance within containers.
Further still, we support the authorization of Bitfusion client pods via the Kubernetes secrets facility
- How does scheduling work?
Bitfusion does not have a formal scheduler. All servers respond to a request and the client will select among the responses that can satisfy the request. If no server has sufficient, currently-free resources, the client will spin reissuing the request with a limited, exponential back-off.
- Will a client spin forever making requests?
No, a time limit can be specified.
- What if I need scheduling capabilities such as prioritization or QOS?
We recommend using a third-party scheduler such as Slurm or LSF. As Bitfusion is just another application, integration is simple.
- Is there a blog on Bitfusion scheduling with more detail, with rhyming section headers, and with spurious quotes from auto racing personalities like Mario Andretti?
- How does Bitfusion affect the performance of the application?
It depends (when doesn’t it?). Each application is different. Large batch sizes, large models are conducive to good performance (low overhead from Bitfusion). Bitfusion uses pipelining, command re-ordering, and other techniques, to keep the GPU computation resources busy while data is transferred in the background; this can hide much of the network latency. Applications with a high ratio of compute-to-communication perform the best.
- What steps can I take to optimize performance?
The ones with the largest effect are typically:
- Use larger batch sizes
- Use a large MTU
- Configure for the lowest network latency possible
- Use an RDMA transport
- What happens if I request more GPUs than exist on any one server?
The command fails.
- What happens if a Bitfusion server fails when a client is using it?
The application will perceive the disappearance of the device from the bus, typically ending the application. Many applications allow you to take checkpoints, so you may restart it with Bitfusion allocating a GPU from a different server.
- How does the Bitfusion cluster handle a failed server?
The remaining Bitfusion servers will continue to function as long as a majority of the original servers are healthy.
- Does live migration work?
That’s not an “error” question, but okay, why make a whole separate section for this.
Live migration will work on Bitfusion clients.
Live migration is not allowed on Bitfusion servers (its GPUs use DirectPathIO, which prevent it).
Moving Forward with Bitfusion
- How do I test Bitfusion?
All you need is a vSphere 7 Plus evaluation license. Get the downloads and documentation at our landing page.
- Does VMware have a Bitfusion web page with demos and blogs?
- What is the future of Bitfusion?
All future looking statements represent no commitment from VMware to deliver features or products. Everything is subject to change whether due to technical feasibility, demand, or other factors.
VMware looks to bring greater generality to the virtualization of acceleration technologies—not just hardware, but software too. This includes support for multiple vendors.
For one approach, read about Project Radium: https://core.vmware.com/blog/application-remoting-energize-radium-not-dilithium.