Bitfusion Scheduling — A Day at the Races

November 04, 2021

Opening — Start Your Engines

If everything seems under control, you're not going fast enough.
― Mario Andretti

One of the questions I hear from almost everyone new to Bitfusion is “How does the scheduling work?” A second question is “What happens when a Bitfusion server fails?”

Bitfusion, in as few words as possible, is a client/server architecture wherein clients, running AI/ML applications, dynamically allocate and access GPUs from a pool of remote servers. The GPU access is accomplished by intercepting the applications’ CUDA calls. Both questions above, however, fall into the allocate realm.

Here are the short answers. I’ll give fuller explanations after that.

Q. How does the scheduling work?
A. There is no formal scheduling.  Each Bitfusion client will spin for a time, if necessary, retrying GPU requests that haven’t succeeded yet.

Q. What happens when a Bitfusion server fails?
A. The same thing that happens when your application is using a local GPU which fails.

Scheduling — Will Pole Position Work for You?

Auto racing began five minutes after the second car was built.
— Henry Ford

Bitfusion allocates GPUs on a first-come-first-served basis.  Each client competes against the others with no central coordination. This has the advantage of simplicity. Also, if you want more, since Bitfusion as a regular application, it is easy to have sophisticated, proven scheduling software, such as Tanzu, K8S, or even Slurm or LSF launch Bitfusion jobs for you.

Four examples suffice to delineate the behavior of Bitfusion allocation. In rhyming fashion, we’ll call them “Everybody Wins”, “Somebody Spins”, “Kicked in the Shins”, and “Ain’t no Pins”. In these examples, we have a pool of two Bitfusion servers. Each server has four GPUs (except in the last example). And we have three clients, violet, blue, and green (with different striping), which try to allocate various numbers of GPUs.

Everybody Wins

Aerodynamics are for people who can’t build engines.
— Enzo Ferrari

This is the happy example. There are enough GPUs, and there is a GPU distribution, to satisfy the requests of all three clients.

Allocation when every client succeeds

Figure 1: Sufficient GPUs and distribution for all clients to allocate successfully.

Given today’s landscape—GPUs are vastly underutilized—this should be the typical situation for most users. We do not know, here, the order in which the clients made their requests, but each sought two GPUs. Each client received responses from both servers with the number of currently available GPUs, GPUs which have been temporarily locked for that client. Clients respond to each server, accepting or rejecting the offered GPUs. Servers unlock the rejected GPUs. In this example, each client succeeded in allocating their desired GPUs.

Somebody Spins

We broke something, I think it was traction…
— Carl Edwards after getting spun out by Dale Jr. at Michigan

Calling upon my years of experience, I froze at the controls.
— Stirling Moss

This example illustrates reasons why a client will enter a request spin.

Client must spin until GPUs become available for allocation

Figure 2: A client enters request spin until GPUs available.

Two clients succeed in allocating three GPUs each. This leaves a single GPU free on each Bitfusion server. The third client (green, diagonal striping) requests two GPUs. While there are a total of two GPUs available in the pool, no single server has two, free GPUs. The third client, then is informed the request can not be satisfied immediately. The client enters into a spin, re-issuing the request after delays with a limited, exponential back-off. The client will spin for a configurable number of minutes and then give up. The default limit for retrying requests is 30 minutes.

Assuming the the first or second client job completes soon, three GPUs will be freed. Then the third, green client will allocate two GPUs and begin running its application.

Kicked in the Shins

He ran out of talent about halfway through the corner.
— Buddy Baker

This example shows a request that fails immediately, because it cannot be satisfied ever, not even if all GPUs were free.

Insufficient GPUs on any server to satisfy client request

Figure 3: A client request fails immediately, as it can never be satisfied.

The third client is requesting six GPUs, but no server possesses 6 GPUs. The client receives a complete denial from all servers. The client prints an appropriate error message and completes.

Ain’t no Pins

What’s behind you doesn’t matter.
— Enzo Ferrari

This last example is the case showing Bitfusion has no formal scheduling.

Clients allocate resources on a first-come, first-served basis

Figure 4: Clients compete on a first-come-first-servee basis with no scheduling or ordering.

We think the situation shown here occurs infrequently, given that GPUs are typically underutilized. If, however, your experience is different , then you should consider using Bitfusion under the scheduling capabilities of Tanzu or some other product. Here, we  have one server with four GPUs and one server with two GPUs. The first client allocated two GPUs from the server with four, leaving only two GPUs free on each server. Now, as the second and third clients request 4 GPUs, neither request can be immediately fulfilled. Ideally, the first client would have taken GPUs from the other server, then one of the two larger requests could have proceeded. But there is no scheduling functionality ensuring the future.

We also note that there is no formal queuing. As the blue and green clients (vertical and diagonal striping) wait for the the four GPUs to free up, we cannot predict which will succeed first.  It is a race condition.

Disney Movie: Herbie the Love Bug splits in half and still wins the race

My delighted, 5-year-old self learned that, even in a Disney movie, race conditions do not lead to endings that are entirely predictable.

Bullet point summary of scheduling:

  • Clients search list of servers for those that can satisfy requests
  • If insufficient GPUs exist, command will fail
  • If insufficient GPUs are available (unallocated), client will spin doing retries (exponential, limited back-off)
  • The length of spin is configurable: give up immediately, give up after 10 minutes, etc. Default is 30 minutes.
  • No formal queue—when multiple clients spin, they all have an equal chance of getting the next available GPUs
  • There is no prioritization nor preemption.
  • Sufficient command line options exist for good integration into scheduling products (vSphere with Tanzu, or even third-party schedulers, such as Slurm)

Failure May Not Be an Option, But Success Is Not a Guarantee

After the third flip, I lost control…
— Don Roberts

No one likes to dwell on failure, but fortunately this section can deal with the topic pretty quickly. Bitfusion allocates GPUs for client applications and Bitfusion monitors applications for completion, so it can free the GPUs. But Bitfusion takes no further measures to ensure that GPUs remain available. When a GPU fails, or a Bitfusion GPU server fails, that failure will be seen by the application as if the GPU ceased functioning.

  • It is up to the application to save what state it can in order to preserve any work already completed.
  • It is up the user or client (or client-side scripts or schedulers) to re-launch the application.

Applications run as if they had local GPUs, subject to GPU failures.

On the plus side, Bitfusion creates no additional burdens for applications and clients when it comes to failure and recovery. And more significantly, since Bitfusion is likely providing a pool of GPUs, clients can instantly launch the application again allocating different GPUs.

On the minus side, we acknowledge that Bitfusion does not add features for recovery, migration, and so on.

Conclusion — And the Winner Is…

Finishing races is important, but racing is more important. 
— Dale Earnhardt

Bitfusion was engineered to raise the sharing and utilization of expensive acceleration resources, GPUs. Its approach to scheduling and failure handling have been kept simple to concentrate engineering effort on the main goal.

Clients currently choose the among the resources currently available, rather than relying on a centralized decision-maker. We may see expended capabilities in later releases.

Hardware failures are reflected as such to the clients, in keeping with the current assumptions of AI/ML applications. 

Filter Tags

AI/ML Application Acceleration vSphere vSphere 7 Bitfusion Blog Deep Dive Fundamental Operational Tutorial Intermediate Manage Optimize