Introduction — To boldly go where no app has gone before

If you liked Bitfusion, we think you will love what we are calling Project Radium. It is a generalized approach to splitting applications and running the segments that need acceleration on remote machines. But I'm going to spill a lot of pixelated ink before I get to the point. The impatient can skip the economics analogy and go straight to the Today or even the Next section.

Specialization and Trade — Live long, and prosper

The law of comparative advantage was explained to me in fun, story form and seemed magic. Wealth appeared from thin air. On the other hand, the article I just read in an online encyclopedia managed to strip the comparative fun it has over other economic laws down to a dismal common.

So, let's review the story version.

Robinson and Friday are stranded on a desert island. The only two things worth getting are coconuts and palm leaves (food and shelter).
Assume you can survive on one coconut a day.
Robinson working all day can pick two coconuts. Alternately, working all day he can get 4 palm leaves.
- So, everyday he gets 1 coconut and 2 palm leaves. Poor man.
Friday working all day can pick three coconuts. Alternately, working all day he can get 12 palm leaves.
- So, everyday he gets 1 coconut and 8 palm leaves. Rich man.
Interesting point: interesting because, counterintuitively, it is not important — Friday is better at BOTH coconuts and palm leaves.
Important point: actually important is that while Friday is merely better at coconuts, he is super-duper better at palm leaves. Because of this important point (and arbitrarily treating palm leaves as money):
- It costs Robinson 2 palm leaves for his daily coconut.
- It costs Friday 4 palm leaves for his daily coconut.
Dismal note: we could treat coconuts as money, but then we'd have to use fractions and lose our younger readers and C-suite occupants.

Since coconuts are cheap to Robinson and dear to Friday, Robinson should specialize in coconuts and Friday in palm leaves. They should trade at a price in between their personal costs.

Every day, Robinson picks two coconuts
Every day, Friday picks 12 palm leaves
They trade 1 coconut for three leaves
- Robinson ends richer - 1 coconut and 3 palm leaves (not 2 leaves). No more subsistence!
- Friday ends richer - 1 coconut and 9 palm leaves (not 8 leaves). He accumulates capital — builds mansion — writes editorial explaining how he is enriching the poor, not oppressing them — suffers death at the hand of revolutionary with sharpened frond stem.

Specialization Applies to Computing, Too

We should easily be able to see that the specialization of microservices and hardware accelerators benefit applications with greater efficiency. For example:

vSphere and Kubernetes (virtualization and orchestration) keep resources busy. Other benefits accrue as elasticity, reliability, recovery, scalability, and more are realized.
Many machine learning (ML) applications use GPUs to run the data-intensive sections of their algorithms (say, matrix multiplication), saving time, money, and power.

But one can always ask, have we wrung as much as possible from our specialization? Can we adjust our specialization to greater effect? Initially, you may not get everything perfect. So, let us first set a comparison point by looking at the specialization and monitoring of Bitfusion.

Kirk demonstrates that specialization can have a downside, too

Bitfusion Today — I canna' change the laws of physics

Bitfusion exists because we want our servers with accelerator resources to specialize in acceleration. We neither want them idle nor running code that a general-purpose machine should run. Bitfusion provides virtualization for GPUs and delivers services to orchestrated containers. Bitfusion pools servers with GPUs and provides their acceleration resources to many clients across the network. Plus, it only provides resources for the time they are needed. And Bitfusion can partition GPUs so multiple clients can consume them at the same time. Bitfusion enters into the land of specialization and trade by splitting your ML app software stack as shown here.

Figure 1: Bitfusion splits the application stack, allowing for remote GPU access from a pool of servers

You have two servers running your application. The bottom of the stack runs on a server with GPUs. The top of the stack runs on a general-purpose machine. The split occurs exactly at the CUDA driver API. To monitor these accesses, Bitfusion interposes a new layer in the stack. Calls to CUDA on the client are intercepted by Bitfusion, which sends them to the remote Bitfusion GPU server. There, the CUDA calls are made in earnest. The interposing layer has many responsibilities, but we can summarize them as keeping the data, code, and execution environment in agreement.

Here are some of the benefits (or wealth, in our economic analogy) we gain by this specialization.

GPUs can be instantly, dynamically allocated and freed.
- Thus, GPUs can be shared, serially, by multiple clients.
Implementation also allows instant, dynamic GPU partitioning.
- Thus, GPUs can be shared, concurrently, by multiple clients.
With these two ways of sharing, utilization of GPUs, a scarce resource, rises.
Users running client applications do not have to start and stop their VMs to acquire and release resources.
Users have access to a wider variety of accelerators.
Users have access to a greater number of accelerators.
Accelerators are in a common pool making them easier to manage and administer.
A large pool averages out the bursts of demand from multiple groups (group A users not likely to burst at the same time as group B users).

Bitfusion has all this today.

Project Radium — Remote Trek: The Next Generation?

After Bitfusion had been introduced to the market, we began to look at its approach to specialization and virtualization. We are conducting research and making prototypes that split applications at different places and by different methods. We call this work Project Radium (after the city in Kansas).

The benefits we are seeking are:

Greater generality
- Avoid splitting at vendor specific APIs (e.g. CUDA)
- Creating immediate compatibility with new releases of vendor software or hardware
Easy inclusion of new accelerators, versions, vendors, libraries, applications, and frameworks
Inclusion of software backends, not just hardware accelerators
The basis to include entirely different devices (move beyond accelerators)
Giving users certain flexibility on how applications are split

Where and if possible, we are also seeking:

To lower overhead and increase performance
Increase reliability generally, and reduce vulnerabilities especially to the changes of evolving APIs

We also hope to ease the job for application writers. Scheduling and distributing micro-services is difficult. Radium allows you to write in a more monolithic style and to leave some of the partitioning and distribution to the tool.

Most of these benefits should come with a first release, others would materialize with subsequent releases.

Rather than just GPU/accelerator remoting, Project Radium can deliver generalized application disaggregation including remote disaggregation.

Figure 2: Project Radium splits applications and monitors both halves on physically separate machines

Project Radium doesn't have to split applications at a at a single, fixed place in the stack. A user-configurable script allows you to select exported library functions or scripted module functions of your application for execution on a remote virtual appliance. Project Radium monitors both halves of the application. As it observes relevant events, it keeps code, data, and execution environment coherent. This is a more generalized approach than Bitfusion uses.

More detail will be coming from OCTO (Office of the CTO)., but one abstraction is to think of Radium as moving the split higher in the software stack. For example, instead of interposing at CUDA calls, Radium could set monitoring points at the level of TensorFlow calls or at a subset of TensorFlow calls. One machine, an initiator, runs normal application code, the top half; another machine, an appliance with accelerators, acts as an acceptor and runs the high-performance code of the bottom half. Project Radium is being engineered for remote acceleration of AI/ML applications. We may see other uses in the future. Out of the box, it will be ready for AMD, Graphcore, Intel, NVIDIA and other vendor hardware.

Figure 3: Configuring Radium to remote the entire TensorFlow imported module

In another use-case, Project Radium offers support for software backends. You can create a pool of servers that run special software — software you can't run everywhere, whether because of expense, or because of a tight coupling with particular hardware. For instance, one backend might be ThirdAI, software that runs efficient mathematics for deep learning on traditional CPUs that can even sometimes surpass GPU performance.

Some Q and A

Q. Will Radium handle applications using multiple processes/threads?
A. Yes, of course. For example, exec, fork, pthread_create are supported.

Q. Do I have to refactor my application?
A. No. Project Radium requires no changes to code or workflows. The application split and the allocation of remote resources occur dynamically when you run the application.

Q. If this is just generalized remoting of applications, why not just ssh the entire application to the accelerator machine?
A. This is not complicated…and it is what we've already said, but the question occurs to everyone anyway. So it is good to review.

Disaggregation — This is the aspect of specialization we are pursuing. We do not want to use up the scarce, specialized resources running code that could run anywhere else.
Environment — we don't want to set up multiple application environments (potentially conflicting environments) on one target machine
Scheduling — we have to have some way to keep people from stepping on each other. As with today's Bitfusion, users only have to set up the application environment on the client machine.

Q. Where can this run?
A. Linux machines using a common ISA (x86 to start) attached by datacenter-class switched networks

Q. What are the software compatibility requirements of client-server (initiator-acceptor) machines?
A. Testing so far indicates that running recent versions of the Linux kernel on both sides suffices. The versions and distributions do not have be exact matches. In general, libraries, drivers, and frameworks need only be installed on one side or the other — but we'll leave the details to later discussions when we can walk through specific examples.

Conclusion — … a dream that became a reality and spread throughout the stars

We've discussed the goals and workings of Project Radium. Aspects of the value proposition naturally emerged along the way. But let's conclude with an explicit examination. Project Radium lets you decompose, manage, and deploy accelerated applications. It is a step toward disaggregated computation.

Isolate 'regular' application code from specialized vendor hardware and libraries
Manage pools of specialized hardware(and software back-ends) separately from generic computing hardware
Significantly improve multi-tenant sharing: pooling and partitioning
Live updates and upgrades of backend services and hardware
Share hardware across users, usage, SLA
Generalized to support multiple architectures, vendors, frameworks, and applications

So, this blog is wrapping up, but Radium's long half-life will yield ongoing and happy prospects.

VMware's Office of the CTO will be publishing more about Project Radium in the weeks to come. Start here!

Disaggregated Application Remoting, Energize! — Radium, not Dilithium