Sizing Guidance for AI/ML in VMware Environments

Abstract

Machine learning (ML) computation differs significantly from that used with classic algorithms. Users and administrators do not know how to size and configure systems running these new workloads. The most concrete requirements are the GPUs used to accelerate the parallel computations that otherwise take a long time run. But many other questions seem unanswered. How much, compute, storage and networking are needed to support the applications and GPUs? Can the machines still be virtualized? Do the application stacks make other demands? What are the performance and cost constraints?

In this document we look at some representative ML applications and see that aside from GPUs their requirements largely mimic the needs of general-purpose applications. But once the GPU types and counts are settled, we find that standard datacenter-class CPUs suffice, standard high bandwidth networking suffice, and standard to higher-end storage configurations suffice.

This examination was done under the umbrella of VMware’s vSphere virtualization. Virtualization of ML systems brings ML out of the separate silos that make it difficult to manage and maintain systems, increases the utilization of the systems, and delivers the other benefits users expect from virtualization generally (reliability, recovery, back-up and cloning, automation and so on). And we see that ML applications run quite well on virtual systems.

The Sizing Ecosystem

This VMware document examines sizing as a “from the bottom up" question; what do the ML applications running on vSphere need? But we also advise you to look at the question from a variety of approaches and authors.

As NVIDIA is the dominant player in AI/ML acceleration today, we also point to the NVIDIA AI Enterprise (NVAIE) stack; it includes components and technologies to enable, deploy, manage, and scale AI workloads. It includes vSphere Enterprise Plus, NVAIE Software Suite, multi-node reference systems With NVIDIA ampere GPUs and NVIDIA networking.

For further information we can suggest the following sites and documents from our partners:

https://infohub.delltechnologies.com/t/design-guide-virtualizing-gpus-for-ai-with-vmware-and-nvidia-based-on-dell-infrastructure-1/.

https://developer.nvidia.com/tao-early-access

https://www.nvidia.com/en-us/data-center/products/certified-systems/

Introduction

Multiple industries are looking at Artificial Intelligence (AI) and Machine Learning (ML) applications to add value and maintain competitiveness. They want to extract more knowledge, more easily, from the warehouses of data they already keep. Concrete examples range from X-ray diagnosis in medicine or risk analysis in banking to image recognition in automotive or inventory control in retail.

Hardware acceleration enables the applications to run with acceptable performance and energy consumption. The accelerators of choice today are NVIDIA GPUs using the NVIDIA compute API called CUDA. In addition, NVIDIA supplies tools, platforms, libraries, and other layers in the AI/ML stack.

This paper considers GPUs as a given requirement, but does not examine in detail how to choose among the GPU models. That choice is for those looking at runtime envelopes for particular applications. For our purposes, sizing begins after the choice of particular GPUs.

This document attempts to size computation, networking, and storage capacities for a few, representative applications and GPUs. The reader can use these representatives to determine reasonable starting points as they experiment with their own applications in their own labs as they empirically determine their specific needs.

The Representative Applications

The representative applications are few, but chosen to span multiple frameworks, models, data types, and both training and inference.

Image classification – TensorFlow benchmark (v 1.9)
- Training
- Resnet50
- FP32
- Batch_size = 1024 (CIFAR10 dataset; 10 categories of thumbnail images); batch size refers to the number of images processed at a time; larger batch sizes tend to improve performance, but need to fit into the GPU frame buffer
- Observations: image classification can imply large datasets, large models, large demand for acceleration

Natural Language Processing – Word Language RNN
- Training
- PyTorch
- Observations: datasets of any size, but may be smaller than image datasets; model sizes range from small to very large (choosing one of moderate size here)
Tabular data – XGBoost
- Training
- Dataset: Scikit California housing (20,640 instances* 9 features(columns))
- Observations: the datasets and models may be smaller than image-related ML
Imaging – MLPerf Inference benchmark
- Inference
- Model: ssd-small
- Observations: emphasis for inference may be high throughput and low cost

Where possible the applications have been run in Docker containers to reflect the trend to run AI/ML in containers as modern applications.

Representative GPUs

We also looked at a spectrum of GPUs.

T4 – 16 GB
A100 – 40 GB
V100 – 16 GB
P100 – 16 GB

Note: the V100 and the older P100 are CUDA-capable GPUs which are still used for AI/ML workloads, however, they are not supported for the NVIDIA AI Enterprise Suite.

Measurements

The following table shows configurations that achieved acceptable levels of performance for each representative application. This could, of course, be a much larger matrix, but supplies a set of points you can extrapolate for your use cases.

Table 1: Measurements of Representative Apps
	Image	NLP	Tabular	MLPerf Inference
Comments	TensorFlow benchmark batch size: 1024 Model: resnet56_v2 FP 32 CIFAR-10 dataset	PyTorch Word Language RNN	XGBoost Dataset from sklearn.datasets: fetch_california_housing	SSD Small object detection
Model Size	Large	Medium	Medium	Small
VM	AWS type: p4d.24xlarge 2 socket Intel Cascade Lake 8275CL	Dell PowerEdge C4140 20 cores of 96 total cores on 2x 8260 processors	Legacy host Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz	Dell PowerEdge R640 20 cores of 96 on 2x 8260 processors
Measured Performance	10211.58 images/sec	15.23 sec per epoch	28 seconds per run (train and eval converge after 1926 iterations)	6154.7 samples per second
CPU Utilization	237% (2.37 cores)/96 cores	100% (1 core) / 20 cores	100% (1 core) / 64 cores	100% (1 core) / 20 cores
System Memory Used/Available	10.2 GB/1.10 TB	2.6 GB / 126 GB	493 MB / 188 GB	2.135 GB / 126 GB
GPU	A100	V100	P100	T4
GPU Utilization	97%	97%	22%	100%
GPU memory (Used/Available)	37.95 GiB (38864 MiB / 40536 MiB)	12.55 GiB (12851 MiB / 16160 MiB)	493 MB / 188 GB	2.748 GiB (2814MiB / 15109 MiB)
Network BW	Incoming Avg: 5.6 kBits/s Min: 1.02 kBit/s Max: 18.20 kBit/s Outgoing: Avg: 34.98 kBit/s Min: 6.70 kBit/s Max: 158.34 kBit/s from nload inside docker	Incoming Avg: 4.12 kBits/s Min: 1.49 kBit/s Max: 8.60 kBit/s Outgoing: Avg: 38.05 kBit/s Min: 5.77 kBit/s Max: 380.77 kBit/s from nload inside docker	Incoming Avg: 32.88 MB/s (including package installing at the beginning of execution...) Min: 0.00 Bit/s Max: 454.19 MBit/s Ttl: 17.68 GByte from nload inside docker	Incoming Avg: 9.63 kBits/s Min: 1.49 kBit/s Max: 14.78 kBit/s Outgoing: Avg: 27.73 kBit/s Min: 5.77 kBit/s Max: 396.80 kBit/s from nload inside docker
Storage Configuration	NFS	NFS4	Local disk	NFS4
GPU power	306W (IDLE: 54W) / 400W	135W – 269W (IDLE: 35W) / 250 W	48W (IDLE: 35W) / 250W	70W (IDLE: 19W) / 70W

Recommendations, Requirements – CPUs and Memory

One of the metrics from the tests in the table above is CPU Utilization. This measurement was collected with the Linux utility, top. The utilization in all cases was not high, matching our expectations. The intensive portions of the algorithms were successfully off-loaded to the GPUs. In no case were more than three CPU cores utilized.

This data should be tempered, however, by two factors: 1) our set of applications is small, and 2) we have seen cases with higher CPU needs. These higher-count cases used many threads to decode or process streaming images before placing them in GPU frame buffer; up to 20 cores have been seen to be well utilized.

The goal in any case, is to use only CPU power sufficient to maximize the GPU utilization.

So while avoiding sweeping conclusions, we believe that the CPU requirements for ML are not extraordinary: the core count ranges from low to around 20, the CPU class is "typical enterprise-class" CPU. The XGBoost application ran perfectly well on a four-year old server and only required a single CPU core.

We recommend for proper sizing and future-proofing, initially using six physical cores per application per GPU per VM, followed by empirical testing to see if fewer cores suffice or if additional cores yield more performance.

We recommend VM system memory be larger than GPU frame buffer memory, perhaps 150% of frame buffer size. VM memory should also be set to reserved.

You might also look at this guide for specific host configuration recommendations: https://docs.nvidia.com/ngc/ngc-deploy-on-premises/nvidia-certified-configuration-guide/index.html

Recommendations, Requirements – Networking

The applications, in combination with the datasets we chose, placed only small demands on the network. For example, the 60,000 images of the CIFAR-10 dataset reside in about 6 python “pickled” files. They are small, color images (32 x 32 pixels) and collectively take 175 MB of memory. Once the 6 files are read from NFS, they stay in system memory, and little else occurs on the network.

On the other hand, we can estimate bandwidth required for other scenarios. We’ll make generous assumptions for a large safety margin.

Table 2: Network Bandwidth Estimates
Bandwidth	Assumptions	Calculation	Estimate
Streaming Images	11,000 CIFAR sized images/sec	32px x 32px x 3-bytes x 8-bits x 11000 images/sec	258 Mb/sec
Distributed applications	Requires the sharing of weights among the multiple GPU nodes Resnet50 model with 26 million weights; 32-bit fp weight; share after batch size of 4k images; 2 nodes send data to each other;	11000 images/sec x 26M-weights x 32-bits x 2 nodes / 4K-batch	4.36 Gb/sec
Internode backend vSAN RAID image access	To be safe, just used the full image stream rate	NA	258 Mb/sec
Total	Round up to half Gb/sec		5 Gb/sec

For single-node applications, expect the networking to scale linearly when the server hosts multiple VMS, or otherwise runs multiple application instances. Scaling for distributed applications may be more complicated—we recommend running empirical experiments.

Network latency is yet another factor to optimize, particularly so for distributed applications. Batches of work to GPUs may have to synchronize with the arrival of data.

We recommend:

10 Gbps network adapters as a minimum, but 25 Gbps or higher is preferred. Account for:
- Future needs
- Running multiple applications
- Distributed processing
Planning for intelligence in the network (smart NICs or DPUs)
RDMA, RoCE, PVRDMA, etc. (most NFS storage boxes will only use TCP/IP, but streaming datasets or distributed applications may benefit)
Given the above, you should consider the later NVIDIA (Mellanox) ConnectX and BlueField adapters
Dedicated networking resources per VM (SR-IOV or DirectPath I/O)
Switched networks (no routers)
High MTU settings – this should be done with careful attention to manufacturer capabilities, tools, and defaults. Mellanox (now NVIDIA) RDMA may be set to 4096, for example.
NUMA affinity – keep network, memory, processors, and GPUs in the same NUMA domain. Refer to VMware’s AI/ML deployment guide for more information (a reference will be given at the end of this document).

Recommendations, Requirements – Storage

Aside from capacity requirements, which users can work out on their own, other storage concerns are bandwidth, latency, reliability, and sharing. These, together, lead to our recommendation for vSAN or NAS devices, rather than local devices.

ML Training is often done by repeated reads from subsets of the overall data, but with randomized access (to avoid overfitting). Write speed is not a typical concern.

If the training subset data is not too large, we encourage device caching to improve performance. In the cases tested above, the data fit in system memory, a particularly close and fast cache. For inference or training with large or non-repeated data accesses, storage should not be skimped on. It must be high-bandwidth and low-latency.

Another concern is sharing. Often a dataset will be used by many applications and users prefer to share a single copy rather than duplicating it for each application instance.

Below we recommend either vSAN or NAS storage. NAS should already be understood by readers of this document. The vSAN product combines the drives on multiple ESXi hosts into multiple RAID file stores accessible by the VMs. Each filestore uses one of the hosts as a front end. We cannot offer a complete description here, but some factors that make vSAN a storage option are: vSAN can scale to petabyte capacities; frontend bandwidth can reach 20 Gbps; it supports file sharing; it supports NFS.

We recommend either:

vSAN
- ESXi hosts
  - Minimum of three, we recommend four or more
  - Minimum two disk groups per ESXi host, up to four
- Storage devices recommendation
  (applies to stores for vSAN file service as well)
  - Cache Disks: SAS solid-state disk (SSD) or PCIe flash device.
  - Capacity Disks: SAS solid-state disk (SSD), or PCIe flash device.
  - Storage Controller: One SAS or SATA host bus adapter (HBA), or a RAID controller that is in passthrough mode or RAID 0 mode.
- Networking recommendation
  - Each host must have 10Gb network bandwidth dedicated to vSAN, we recommend 25Gb or more.
  - All ESXi hosts in vSAN cluster must be connected to a vSAN Layer 2 or Layer 3 network
  - Maximum of 1 ms between all ESXi hosts in the cluster
NAS devices (especially when greater capacity is required)
- Consider local SSD with NFS caching

Recommendations, Requirements – GPUs

All of the GPUs tested can provide CUDA acceleration to the applications. They provide acceleration with differing performance, power, and price points. The decision as to which GPU model is best for you is beyond the scope of this document, but some typical tipping points we see with customers are:

Lower cost GPUs such as the T4 – where a high number of GPUs are required (at the edge, for example), or for inference, or where result latency is not critical, or for education.
Middle Tier GPUs – GPU pools in data centers, training, development, latency important but not critical, satisfied to increase throughput by scaling out. The A40 is an example.
Latest and greatest GPUs – low-latency critical, high throughput in a small footprint. The A100, and A30, for example.

If you are purchasing new GPUs, rather than re-purposing existing models, we recommend the GPUs qualified for NVIDIA AI Enterprise: A100, A30, and T4.

Once the GPU decision (or decisions—there can be multiple use cases) has been made, then the supporting infrastructure can be selected: the number of hosts, the networking and storage, etc.

One aspect of GPUs that VI admins need to know is that their RAM (also known as frame buffer) cannot be overcommited. If an application requires 16 GB of frame buffer, for example, that 16 GB has to physically exist on the GPU (or slice of a GPU) that has been allocated to the VM.

Recommendations, Requirements – VMs

This section may repeat recommendations given in other sections, but it gathers VM configuration advice in one place. The recommendations for VMs make a short list of basic settings: size, NUMA, memory, though, in some cases, you may take a more advanced approach to tune for networking.

We recommend:

Allocating only physical cores to the VMs on the ESXi hosts. The total of the core counts on the VMs should not exceed the available physical cores.. Do not allocate a count requiring the logical cores or hyperthreading. The “extra” logical cores can still be used by ESXi helper threads which is be fine.
Reserve memory. This is required for use with NVIDIA GPUs. You will have to conduct experiments with your particular applications and GPUs to ensure you have allocated sufficient for your use cases.
If you have multiple VMs on your host, adhere strictly to NUMA boundaries.
If you are using the network heavily, you may wish to tune for bandwidth and/or latency.
- https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.networking.doc/GUID-FECAC41A-2C7A-4AD6-B740-7D8D44BADB52.html
- https://www.vmware.com/techpapers/2011/best-practices-for-performance-tuning-of-latency-s-10220.html
Because it is needed for many GPU use cases (e.g. distributed apps, vGPU), enable peer-to-peer and ATS with advanced configuration parameters. Refer to VMware’s AI/ML deployment guide for more information (a reference will be given at the end of this document)

Cluster Recomendations

We recommend clusters of AI/ML GPU systems not be used for multiple purposes. The use cases will grow. It will be simpler to scale up the resources in one dimension—AI/ML requirements—than if you have to account for changing ratios of other, classical applications. I.e. we do not recommend running non-ML workloads on the same ESXi hosts.

Implementing the Recommendations

This document does not supply implementation details for most of it recommendations. We suggest reading our AI deployment guide as a good place to start: https://core.vmware.com/resource/deploy-ai-ready-enterprise-platform-vsphere-7-update-2.

Future Concerns for this Document

Over time we may wish to expand the list of representative applications. We may wish to update statistics as new GPUs and servers replace old ones. And we may calibrate more broadly to match trends in AI and ML.

Future revisions may also examine distributed applications in greater depth.