Authors: Patryk Wolsza (Intel), Dave Morera (VMware)

At VMware Explore 2023, VMware announced a collaboration with Intel to help customers accelerate the adoption of artificial intelligence (AI) and enable private AI everywhere – across data centers, public clouds, and edge environments. VMware Private AI with Intel combines Intel’s AI software suite, Intel® Xeon® processors with built-in AI acceleration and VMware Cloud Foundation to enable customers to build and deploy private AI models on the infrastructure they have, thereby reducing total cost of ownership and addressing concerns of environmental sustainability. VMware and Intel now make it possible to fine-tune smaller, economical state-of-the-art models that are easier to update and maintain on shared virtual systems, which can then be delivered back to the IT resource pool when the batch AI jobs are complete. Use cases such as AI-assisted code generation, experiential customer service centers, natural language processing, and classical machine learning / statistical analytics can now be co-located on the same general-purpose servers running applications.

Intel and VMware have jointly conducted thorough performance testing of AI workloads (Llama-2-chat) on VMware Cloud Foundation (VCF) and Intel Xeon 4^th Generation CPUs with Intel AMX (Advanced Matrix Extensions). It is important to highlight that this AI workload test was conducted solely on Intel CPUs without leveraging any additional hardware such as GPUs. This document is intended to guide those looking to leverage this technology on existing or new infrastructure, based on a series of tests while running a real-world use case like inferencing a GenAI chatbot.

Intel AMX

Organizations can optimize AI workloads with 4^th and 5^th Generation Intel Xeon Processors with Intel Advanced Matrix Accelerators (AMX). Intel AMX is an on-chip AI accelerator within the CPU capable of handling different AI use cases such as inference and Deep Learning (DL), while providing performance to AI workloads with a cost-effective approach running on VMware Cloud Foundation.

Software Considerations

VMware Cloud Foundation (VCF)

VCF is the leader in Gartner Magic Quadrant for Distributed Hybrid Infrastructure. It is a multi-cloud platform that enables organizations to modernize data centers and deploy modern apps across a spectrum of use cases, including cloud infrastructure, IT automation, hybrid cloud, end-to-end security, and virtual desktops.

vSphere, a core component of VCF, is trusted and deployed by more than 300K customers globally. It is everywhere data is created, processed, or consumed and is the fastest means to launch and deploy AI models on Deep Learning VMs. VMware Cloud Foundation includes features such as Distributed Resources Scheduler, which improves workload management by grouping VMware ESXi hosts into resource clusters to segregate the computing needs of different applications. The Distributed Resources Scheduler helps efficiently use spare compute capacity for AI model training and inferencing, leveraging the same clusters other AI applications use, thereby maximizing capacity and reducing TCO. Additionally, VCF components such as Aria Suite, and vSAN ESA deliver enhanced performance, while contributing to scalability, resilience, and simplicity to the infrastructure.

Testing Overview

Running AI workloads on Intel Xeon CPUs with AMX, opens the door to new capabilities and solutions for existing VCF and Intel customers who already have this key infrastructure in place in their organization.

The objective of this exercise is to evaluate the processing efficiency of Llama-2-Chat when executed on Intel Xeon Processors. Scalability is examined across various processor configurations and algorithmic approaches for generating output, as well as responsiveness and latency under varying workload conditions. Input data sizes and types are analyzed for their impact on performance. Finally, generating performance profiles aids in comparing Llama2 with other LLMs and hardware platforms for informed deployment decisions.

Meta released the Llama 2 family of large language models (LLMs), ranging from 7 to 70 billion parameters. The fine-tuned LLM, Llama-2-Chat, excels in dialogue applications, outperforming open-source chat models on various benchmarks and matching the performance of popular closed-source models like ChatGPT and PaLM in human evaluations for helpfulness and safety.

Tests were conducted on VMware Cloud Foundation virtualization platform as well as Llama-2-Chat models with 7 billion and 13 billion parameters. Both tests utilized the same variables: in this case the variables used were batch size and Input token size. Both Greedy and Beam search algorithms were utilized in Natural Language Processing (NLP) tasks.

Batch Size: Batch size is referred to the number of concurrent users utilizing the chatbot. For all tests, batch sizes of 1,2,4, and 8 concurrent users were utilized. Additionally, for tests utilizing the Greedy algorithm, batch sizes of 16, and 32 concurrent users were utilized.
Input Token: Input token refers to the number of tokens that the user places in the chatbot window as input text. Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word, or even a character — depending on the tokenization process. For this exercise, input token sizes of 32, 256, 1024, and 2048 were utilized. The input token size 32 is used to simulate a basic question delivered in a single sentence, and a 2048 input token size as equivalent to a more complex question or a few short questions.

Environment Configuration

Hardware Configuration

Software Configuration

Test results were measured against tokens per seconds, while aiming for latency under 100ms per industry standards.

Test Results

Testing was based on six key use cases. Three of the use cases were based on Llama-2 7B model with input sizes of 256, 1024, and 2048. The same input sizes were tested against the Llama-2 13B use cases.

The findings clearly show the different performance profile based on the Intel CPU utilized.

Llama-2 7B Use Case

For the Llama-2 7B use case, the Intel Gold 6448H CPU, as well as both Intel Platinum CPUs were able to easily handle up to sixteen concurrent users using an input size of 256. With an input size of 1024, all three CPUs were able to sustain up to 8 concurrent users without breaking the 100ms latency SLA established. For an input size of 2048, a clear difference in performance is evident where the Intel Gold 6448H was only able to support two concurrent users, the Intel Platinum 8462Y CPU performed better and supported up to four concurrent users, while the Intel Platinum 8490H performed the best with this input size and supported up to eight concurrent users within the SLA established.

Llama-2 13B Use Case

For the Llama-2 13B use case, the Intel Gold 6448H CPU was sufficient to handle up to four concurrent users using input size of 256, two users with input size 1024, but only one user for input size 2048. The Intel Platinum 8462Y CPU shows performance capacity to handle up to eight users with input of 256, up to four users using input size of 1024, but only two users for input size of 2048. In addition, the Intel Platinum 8490H CPU showed increased performance capabilities for up to eight users for both input sizes of 256 and 1024, while still delivering performance capabilities for input size of 2048 for up to four users in this 13 billion parameter AI use case.

High Density Use Case

Customers who require more concurrent users may want to consider Greedy search configuration. Greedy search optimizes response time by picking an analytical path that seems optimal at the moment, making it computationally more efficient, without regard to overall optimizations. Beam search, in contrast, considers multiple alternatives, aiming for diverse output sequences. While more computationally expensive, it provides better results.

Both Greedy Search and Beam Search algorithms were tested for the Llama-2 7B model, input of 1024 with Intel CPU 8490H. This test demonstrated that the Intel Platinum 8490H CPU with AMX was able to handle eight concurrent users, without exceeding the 100ms latency SLA.

Using the same CPU, LLama-2 model, and input size, our tests confirmed that by switching to Greedy search algorithm, a 4x improvement in user density was achieved on an Intel Platinum 8490H CPU for a 7 billion parameter AI use case, all while adhering to the Service Level Agreement (SLA).

Considerations

Some AI Workloads are well suited to CPU architectures
Easily available Intel Xeon Scalable CPUs deliver flexibility, and cost-effective solutions to running AI workloads
VMware Cloud Foundation provides an architecture abstraction to simplify AI workloads on Intel Xeon CPUs with AMX
Proper workload sizing should be considered when determining the appropriate processor type

Intel Xeon Scalable CPU with AMX on VMware Cloud Foundation delivers a complete set of software and hardware capable of running AI workloads across the infrastructure without specialized hardware accelerators. This solution strategically expands access to the benefits of AI by economically leveraging existing infrastructure, capitalizing on the robust capabilities of Intel Xeon processors with AMX technology. Additionally, the seamless integration with VMware Cloud Foundation environments not only amplifies affordability and accessibility but also streamlines operations, driving down Total Cost of Ownership (TCO) and accelerating Time to Value (TTV).

During the planning phase of AI workload deployments, it is crucial to assess the hardware and software stack needs comprehensively. The test results offer valuable insights into the capacity of CPUs to manage AI workloads independently. VMware Cloud Foundation, a leading private cloud solution, in conjunction with Intel AI Tool Selector, efficiently handles AI workloads, ensuring seamless integration and optimized performance.

Overall, it can be inferred that the Intel Xeon Gold 6448H CPU provides a cost-effective way to start with AI workloads with lower performance requirements. For medium size AI workloads, the Intel Xeon Platinum 8462Y provides a balanced performance level for such workloads, while the Intel Xeon Platinum 8490H delivers better performance for the more demanding AI workloads. All of this, running on top of VMware Cloud Foundation as a key component to the solution.

Contributors:

Chris Gully (VMware), Earl Ruby (VMware), Luke Huckaba (VMware), Susan Yeager (VMware), Amit Bodas (Intel)

Find Out More

For more information on accelerating AI/ML workloads on vSphere 8 and Intel Xeon CPUs please read:

Intel and VMware Press release for AI/ML Solutions - https://news.vmware.com/releases/vmware-explore-2023-barcelona-intel-private-ai

Intel AI Software suite - https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/overview.html

AI without GPUs: Accessing Sapphire Rapids AMX instructions on vSphere - https://core.vmware.com/blog/ai-without-gpus-accessing-sapphire-rapids-amx-instructions-vsphere

How to create a Local TKR Content Library https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-with-tanzu-tkg/GUID-19E8E034-5256-4EFC-BEBF-D4F17A8ED021.html

Accelerate AI Workloads on VMware vSphere / vSAN Using 4th Gen Intel® Xeon® Scalable Processors with Intel® AMX — Solution Design Brief - https://www.intel.com/content/www/us/en/content-details/780611/accelerate-ai-workloads-on-vmware-vsphere-vsan-using-4th-gen-intel-xeon-scalable-processors-with-intel-amx-solution-design-brief.html

4^th Gen Intel Xeon Gold 6448H Configuration: 4-node cluster, tests conducted on 1 node, 2x Intel(R) Xeon(R) Gold 6448H, 32 cores, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 4 [0], DSA 2 [0], IAA 2 [0], QAT 4 [0], Total Memory 1024GB (16x64GB DDR5 4800 MT/s [4800 MT/s]), BIOS ESE114R-2.14, microcode 0x2b0001b0, 2x MT2894 Family [ConnectX-6 Lx], 1x Ethernet interface, 2x MT2892 Family [ConnectX-6 Dx], 1x 2G Virtual Media, 8x 5.8T Micron_7450_MTFDKCC6T4TFS, 1x 894.2G M.2 NVMe 2-Bay RAID Kit, VMware vSphere 8.0U2, build 22380479, Ubuntu 22.04.3 LTS VM (vHW=21, vmxnet3), Kernel 5.15, LLM Llama2-HF 7B,13B, IPEX2.2, Pytorch 2.2, Batch size=1,2,4,8, input token size=256,1024, 2048, output token size=256, algorithm=beam search4, VM=60vCPU(reservation)+400GB RAM(reservation), Latency sensitivity mode:high, multi socket scenario (30 cores per AI instance), Test by Intel as of 03/18/24.

4^th Gen Intel Xeon Platinum 8462Y Configuration: 4-node cluster, tests conducted on 1 node, 2x Intel(R) Xeon(R) Platinum 8462Y+, 32 cores, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 2 [0], DSA 2 [0], IAA 2 [0], QAT 2 [0], Total Memory 512GB (16x32GB DDR5 4800 MT/s [4800 MT/s]), BIOS 2.1.5, microcode 0x2b000571, 2x BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller, 2x MT2892 Family [ConnectX-6 Dx], 1x Ethernet interface, 1x 0B Virtual Floppy, 1x 2G Virtual CD/DVD, 1x 447.1G M.2 Device, 6x 2.9T NVMe P5620 MU 3.2TB, VMware vSphere 8.0U2, build 22380479, Ubuntu 22.04.4 LTS VM (vHW=21, vmxnet3), LLM Llama2-HF 7B,13B, IPEX2.2, Pytorch 2.2, Batch size=1,2,4,8, input token size=256,1024, 2048, output token size=256, algorithm=beam search4, VM=60vCPU(reservation)+400GB RAM(reservation), Latency sensitivity mode:high, multi socket scenario (30 cores per AI instance), Test by VMware as of 03/25/24.

4^th Gen Intel Xeon Platinum 8490H Configuration: 4-node cluster, tests conducted on 1 node, 1-node, 2x Intel(R) Xeon(R) Platinum 8490H, 60 cores, HT On, Turbo On, NUMA 2, SNC=2, Integrated Accelerators Available [used]: DLB 8 [0], DSA 8 [0], IAA 8 [0], QAT 8 [0], Total Memory 512GB (16x32GB DDR5 4800 MT/s [4800 MT/s]), BIOS 05.01.00, microcode 0x2b000461, 2x Ethernet Controller E810-C for QSFP, 1x 894.3G INTEL SSDSC2KG960G8, 8x 3.5T INTEL SSDPF2KX038TZ, VMware vSphere 8.0U2, build 22380479, Ubuntu 22.04.3 LTS VM (vHW=21, vmxnet3), Kernel 5.15, LLM Llama2-HF 7B,13B, IPEX2.2, Pytorch 2.2, Batch size=1,2,4,8,16,32,64,128, input token size=256,1024, 2048, output token size=256, algorithm=beam search4 and greedy, VM=116vCPU(reservation)+400GB RAM(reservation), Latency sensitivity mode:high, multi socket scenario (19 cores per AI instance), Test by Intel as of 03/18/24.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Configuration Guidance for VMware Private AI with Intel