VMware Private AI Starter Pack for Retrieval-Augmented Generation (RAG) Systems

Introduction

Retrieval-augmented generation (RAG) systems combine the capabilities of large language models and information retrieval systems to improve the quality of generated responses. In a RAG system, the process generally involves two main steps:

1. Retrieval: Given a query, the system retrieves relevant documents or passages from a pre-existing knowledge base or corpus. This retrieval step is usually performed using traditional information retrieval techniques or learned embeddings.

2. Generation: The retrieved documents are then used to augment the input to a generative model, like Llama 2. The model uses this additional context to generate more informed, relevant, and accurate responses.

By combining retrieval and generation, RAG systems aim to provide answers that are not only contextually coherent but also factually accurate, even when the information is not explicitly present in the model's pre-training data.

This document provides an overview of the Starter Pack code repository, which includes three Jupyter Notebook examples that gradually introduce the elements required to build an RAG system.

To get started, we create a basic conversation application (chatbot) built upon Llama-2-13b-chat LLM running on a vLLM server instance. Then, we use Gradio to give conversation memory to the LLM and to give users a web UI that provides a standard chatbot interface. Then, we introduce the LangChain tool set to build a complete RAG generation chain.

Here is an overview of the open-source technologies used to implement the examples:

LLama 2 13b Chat. Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format.
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is a quick and user-friendly library designed for LLM inference and deployment. In this instance, we utilize the VLLMOpenAI class from LangChain to interact with the vLLM service that is compatible with OpenAI's API.
Gradio provides a swift means to showcase machine learning models through an accessible web interface.
Langchain is a framework for the fast prototyping of applications powered by language models. Langchain links and coordinates actions between LLMs, data retrievers, and memory systems to expand the knowledge domain of LLMs beyond the datasets used to pre-train and fine-tune them.

Running the Starter Pack for RAG

Hardware Resources

Table 1 describes the hardware configuration of the VM used to run Llama-2-13b-chat LLM on vLLM and to run the notebooks that form the Started Pack for RAG.

Table 1. VM Setup to Run the Starter Pack for RAG

Component	Description
vCPU	32 vCPU equivalent to 32 logical processors provided by 2x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
RAM	256GB (around 192GB get active)
GPU	1 x NVIDIA A100 40GB PCIe
Network	2 x Intel(R) Ethernet Controller X710 for 10GbE SFP+ (Management), 1 x NVIDIA ConnectX-6 Dx (Workload)
Storage	2 x Disk Groups, each Disk Group: 1 x Cache Disk: 800GB WRITE-INTENSIVE SAS SSD 6 x Capacity Disks: 960GB CLASS E SAS SSD

Software Configuration

Installation of NVIDIA Drivers and CUDA

All our tests were made on Ubuntu 22.04 LTS or 20.04 LTS. We provide a couple of options for you to set up an Ubuntu VM with - NVIDIA drivers an CUDA installed:

Option 1. Deploy a service VM and install NVIDIA grid licensing for vGPU.

Visit the web page that contains the Creating VM Service for Single Node Learning section and follow all the instructions.

Option 2. Install the open-source NVIDIA driver and CUDA on a preconfigured Ubuntu desktop machine. Refer to the Installing NVIDIA Grid GPU Drivers and CUDA 11.8 section and follow the instructions.

Installation of (mini) Conda and Virtual Environment Creation

Now follow the Miniconda Installation Steps section to get Miniconda installed.

After that, follow the Python Virtual Environment Setup procedure to create and activate a Python virtual env containing all the libraries required to run the Starter Pack notebooks and the vLLM server.

Starter Pack for RAG Notebooks

Inside the Chatbot Directory

e1) Implementing a chatbot with Llama-2-13b-chat served by vLLM, through the Gradio UI

The e1-Llama2-chatbot-on-vLLM-with-Gradio.ipynb notebook implements a chatbot using Llama-2-13b-chat served by vLLM. The Gradio UI handles the user interactions with the LLM from a web browser. Across the notebook, we discuss how to format prompts for Llama 2 properly and the Gradio's memory mechanisms to keep track of a conversation between the LLM (chatbot) and the user. Finally, we configure the Gradio UI elements that users can use to control the LLM generation parameters and submit prompts for completion. Figure 1 shows a dialog between a user and chatbot via the Gradio web UI.

Figure 1. Gradio Chatbot Interface

e2) Implementing a chatbot using Llama-2-13b-chat, vLLM, LangChain, and the Gradio UI

The e2-Llama2-chatbot-vLLM-Langchain-Gradio.ipynb notebook introduces the fundamental LangChain constructs required to implement a conversation chain (LLMChain) with memory (via the ConversationBufferMemory class) and the different prompt template types that get combined to prompt LLama 2 in proper and diverse ways. Next are some highlights about the code.

Here is a view on the initialization of the client session to the vLLM instance serving the LLM.

def setup_chat_llm(max_tokens=500, temperature=0.1, top_p=.9):
    """Initializes the llm chat object that allows language chains get access
    to the Llama 2 LLM service.
    :param max_tokens: Max num. of tokens to generate
    :param temperature:  Determines how creative the model should be.
    :param top_p: Cumulative probability threshold for selecting the next word
    :return: the llm service callable object"""
    llm = VLLMOpenAI(
        openai_api_key = "EMPTY",
        openai_api_base = INFERENCE_SRV_URL,
        model_name = "meta-llama/Llama-2-13b-chat-hf",
        max_tokens = max_tokens,
        temperature = temperature,
        top_p = top_p,
    )
    return llm

Here is a view of the prompt templates required to interact with Llama-2-chat via LangChain.

# System prompt template
sys_prompt_template = (
"""<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant.
Always provide helpful, truthful, and safe answers. Safety must be your highest priority.
Your answers must not contain harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in tone and nature.
If an instruction doesn't make sense or is not coherent, inform the user instead of trying to answer.
Don't provide false information if you don't know the answer to a question.
<</SYS>>
""")
# Initialize the system (partial) prompt object for LangChain
system_message_prompt = SystemMessagePromptTemplate.from_template(sys_prompt_template)
# Human (user) prompt section template definition
human_template = "{delimiter}{user_message} [/INST]"
# LangChain's human (partial) prompt message object initialization.
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

# >> Prompt template initialization. Notice the placeholder for the "chat_history" key
# that LangChain will use to give the LLM context about its dialog with the user.
prompt = ChatPromptTemplate(
    messages=[
        system_message_prompt,
        MessagesPlaceholder(variable_name="chat_history"),
        human_message_prompt,
    ]
)

Finally, here are the main LangChain components that generate responses using the LLM’s pre-trained data and the conversation context (memory):

# >> Initialize LangChain's ConversationBufferMemory used for conversation bookkeeping 
# The "memory" and "input" keys should match keys from the prompt. 
memory = ConversationBufferMemory(memory_key="chat_history",
                                  input_key="user_message",
                                  return_messages=True)
# Initialize the LLM service object
llm = setup_chat_llm()

# >>> Create an LLMChain out of the llm, memory and prompt objects previously created.
# This chain is designed to run queries against LLMs
conversation = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=False,
    memory=memory

Under the chatbot-with-RAG Directory

e3) Implementing a chatbot with Retrieval-Augmented Generation (RAG) using Llama-2-13b-chat, vLLM, LangChain, and the Gradio UI

The e3-RAG-Llama2-chatbot-vLLM-Langchain-Gradio.ipynb introduces the components required to build a RAG system; these are:

Document loaders. In this case, we load three PDF documents containing NY Times articles about the Otis hurricane, which battered the Pacific coast of Mexico in late Oct 2023. This data is several months more recent than the datasets used to pre-train Llama 2 (released in July 2023). We use this info to expand the LLM's knowledge to answer questions on fresh data.

# >>> Load all PDF documents containing the NY Times articles
# about hurricane Otis from the "files" directory
loader = PyPDFDirectoryLoader(path="./files")

Embeddings encoders. These encode the loaded documents into numeric vectors for semantic similarity search purposes.

# >>> Set up the embedding encoder (Sentence Transformers)
# Notice that CPUs can be used to encode text from knowledge bases.
model_name = "all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = SentenceTransformerEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Splitters divide documents into multiple chunks that get loaded into a vector database.

# Split the documents into chunks.
doc_splits = loader.load_and_split()

Retrievers are vector databases (Chromadb in this case) containing the embeddings resulting from the document encoding process. When a user enters a query, the retriever will pull all the document chunks that are semantically related to the question.

# >>> Encode the document splits using the embeddings encoder.
# The encoded splits get stored into a Chroma vector database
# which serves as a retriever of chunks of text related to a user's query.
retriever = Chroma.from_documents(documents=doc_splits, embedding=embeddings).as_retriever()

The LLM service will use the context provided by the retriever to answer users' questions. This is the same code as in example 2 (already displayed).
A custom prompt template for Llama 2-chat to interact with RetrievalQA chains:

# >>> Define a LLama RAG prompt that instructs the LLM to generate answers to users' queries
# using the context provided by the retriever and the chat history (kept in the chain's memory).
prompt_template = (
"""[INST]<<SYS>> 
You are an assistant for question-answering tasks. 
If you don't know the answer, just say that you don't know. Keep the answer concise.
Use the following context delimited by <CTX></CTX>, and the chat history delimited by <HS></HS> to answer the question.<</SYS>>  

<CTX>
{context} 
</CTX>

<HS>
{history}
</HS>

Question: {question}

Answer: [/INST]""")

# >>> Create the prompt template object including the explicit declaration of
# input variables that get dynamically entered as a conversation evolves.
prompt = PromptTemplate(
    input_variables=["history", "context", "question"],
    template=prompt_template,

The RetrievalQA chain is a LangChain class that brings all the previous components together into a single entity (object) capable of accepting users' queries and feeding the LLM with the data required to generate the answer.

# >>> RetrievalQA chain initialization. This object coordinates actions among the LLM service, the context retriever
# and the conversation history (memory) to generate responses to users' queries about a knowledge base. 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=retriever,
    verbose=True,
    chain_type_kwargs={
        "verbose": True,
        "prompt": prompt,
        "memory": memory,
    }
)

The chatbot interface is like the one presented by Figure 1.

About the Author

Enrique Corro has been with VMware for 17 years and has a master’s degree in data science from the University of Illinois. He currently works as a staff engineer at VMware’s Office of the CTO. Enrique focuses on helping VMware customers run their ML workloads on VMware technologies. He also works in different initiatives for ML adoption within VMware products, services, and operations.

The following reviewers also contributed to the paper content:

Chen Wei, Director of Workload Technical Marketing in VMware
Catherine Xu, Senior Manager of Workload Technical Marketing in VMware

Feedback

Your feedback is valuable.

To comment on this paper, contact VMware Office of the CTO at genai_tech_content_feedback@vmware.com.

Filter Tags

AI/ML Document