May 09, 2024

Improved RAG Starter Pack: Path to Building Production-grade Apps on VMware Private AI Foundation

Improved RAG Starter Pack: Path to Building Production-grade Apps on VMware Private AI Foundation

With the recent release of VMware Private AI Foundation with NVIDIA (PAIF-N), which includes multiple demos about implementing essential Retrieval-Augmented Generation (RAG) pipelines, we considered it may be valuable for customers and partners to receive guidance on implementing improvements for essential RAG pipelines to take their Large Language Model (LLM) applications to the next level.

PAIF-N includes Private AI Automation Services powered by VMware Aria Automation, which a data scientist can leverage to deploy a RAG application for testing. Underneath the covers, the deployment uses multiple NVIDIA microservices, such as the NVIDIA NeMo Retriever and the NVIDIA Inference Microservice (NIM) [1].

For customers who want to take their RAG application to the next level, we have developed a (starter) pack of Python notebooks to provide step-by-step examples of implementing more advanced RAG methods. These notebooks implement enhanced retrieval techniques, enriching LLMs with more relevant contexts that help LLMs produce more accurate and reliable responses when faced with questions about specialized knowledge that might not be part of the LLM’s pre-training corpus. By doing this, we can effectively reduce LLM hallucinations and enhance the reliability of your AI-driven applications.

Note: Even when the different RAG approaches presented in this article can help improve the performance of essential RAG pipelines, there is additional risk management on RAG pipelines, such as content moderation and monitoring, that need to be put in place to make a RAG system for production.

The Anatomy of the Improved RAG Starter Pack

The GitHub repository’s directory hosting this starter pack offers a gradual approach to implementing the different elements of a basic RAG system. It is powered by popular technologies such as LlamaIndex (LLM-based application development framework), vLLM (LLM inference service), and PostgreSQL with PGVector (vector database). Once implemented, we improve the basic RAG system by adding more sophisticated retrieval techniques, which we’ll explain further in this document. Finally, the different RAG approaches are evaluated with DeepEval and compared to identify the pros and cons of each approach.

The directory structure is organized as follows.


├── 01-PGVector (START HERE)  

├── 02-KB-Documents  

│   └── NASA  

├── 03-Document_ingestion  

├── 04-RAG_Variants  

│   ├── 01-Simple_Retrieval  

│   ├── 02-Sentence_Window_Retrieval  

│   └── 03-Auto_Merging_Retrieval  

├── 05-RAG_Dataset_Generation  

│   └── LlamaIndex_generation  

│       └── qa_datasets  

└── 06-RAG_System_Evaluation  

Now, let’s discuss the content for each section.

PGVector Instantiation (01)

Section 01 implements pulling a Docker container with PGVector deployed on top of PostgreSQL. The process includes using a docker-ompose.yaml file to establish the configuration parameters required to set up and launch PGVector and PostgreSQL. PGVector is the vector store that LlamaIndex will use to store the knowledge base (text, embeddings, and metadata), which will augment the LLM knowledge and produce more accurate responses to users’ queries.

KB Documents Download (02)

Every RAG demo and intro utilizes a knowledge base to augment the generation capabilities of LLMs when asked about knowledge domain(s) that might not be part of their pre-training data. For this starter pack, we decided to use ten documents from NASA’s history e-books collection, which offers a variant of the typical (and repetitive) type of documents seen in tutorials about RAG. This folder provides a Linux shell script to pull the NASA e-books using wget.

Document Ingestion (03)

This section contains the initial Jupyter notebook where LlamaIndex is used to parse the e-books (PDF format), split them into chunks (LlamaIndex nodes), encode each node as a long vector (embedding) and store those vectors in PostgreSQL with PGVector, which acts as our vector index and query engine. The following picture illustrates how the document ingestion process flows:

Once PGVector has ingested the nodes containing metadata, text chunks, and their corresponding embeddings, it can provide a knowledge base for an LLM to generate responses about data from NASA history books.

RAG Variants Implementation (04)

In this directory, we include three subdirectories, each with one Jupyter Notebook that explores one of the following RAG pipeline implementation variants powered by LlamaIndex and open-source LLMs served by vLLM :

Standard RAG Pipeline + re-ranker: This notebook implements a standard RAG pipeline using LlamaIndex, incorporating a final re-ranking step powered by a re-ranking language model. Unlike the embedding model, a re-ranker uses questions and documents as input and directly outputs similarity instead of embedding. You can get a relevance score by inputting the query and passage to the re-ranker. The RAG pipelines implemented in this notebook use the BAAI/be-reranked-base model. The following picture illustrates how the re-ranking process works.

 Sentence Windows retrieval: The Sentence Window Retrieval (SWR) method improves the accuracy and relevance of information extraction in RAG pipelines by focusing on a specific window of sentences surrounding a target sentence. This focused approach increases precision by filtering out irrelevant information and enhances efficiency by reducing the volume of text processed during retrieval. Developers can adjust the size of this window to better tailor their searches according to the needs of their specific use cases. However, the method has potential drawbacks; concentrating on a narrow window risks missing critical information in adjacent text, making the selection of an appropriate context window size crucial to optimize both the precision and completeness of the retrieval process. The Jupyter Notebook in this directory uses LlamaIndex's implementation of SWR via the Sentence Window Node Parsing module that splits a document into nodes, each being a sentence. Each node contains a window from the surrounding sentences in the nodes’ metadata. This list of nodes gets re-ranked before being passed to the LLM to generate the query response based on the data from the nodes.

 Auto Merging Retrieval: Auto-merging retrieval is a RAG method designed to address the issue of context fragmentation in language models, mainly when traditional retrieval processes produce disjointed text snippets. This method introduces a hierarchical structure where smaller text chunks are linked to larger parent chunks. During retrieval, if a certain threshold of smaller chunks from the same parent chunk is met, these are automatically merged. This hierarchical and merging approach ensures that the system gathers larger, coherent parent chunks instead of retrieving fragmented snippets. The notebook in this directory uses LlamaIndex’s  AutoMergingRetriever to implement this RAG variant.

Evaluation Dataset Generation (05)

The Jupyter Notebook in this folder exemplifies the use of LlamaIndex's RAGDatasetGenerator to generate a question/answer dataset that evaluation frameworks such as DeepEval can use to evaluate the quality of RAG pipelines and determine how changes in critical components of the RAG pipeline, such as LLMs, embedding models, re-ranking models, vector stores, and retrieval algorithms, affect the quality of the generation process.

In this example, we generated the Q&A test set using mistralai/Mixtral-8x7B-Instruct-v0.1, one of the most capable open-source models available by Q2-2024. As this test set is meant to evaluate our RAG pipelines, we should use the most powerful model we can get.

RAG Pipeline Evaluation (06)

This folder contains the Jupyter Notebook that uses DeepEval to evaluate the previously implemented RAG pipelines. For this purpose, DeepEval uses the evaluation dataset generated in the previous step. Here is a brief description of the DeepEval metrics used to compare the different RAG pipeline implementations:

 Contextual Precision measures your RAG pipeline's retriever by evaluating whether nodes in your retrieval context relevant to the given input are ranked higher than irrelevant ones. Deepeval's contextual precision metric is a self-explaining LLM-Eval, which outputs a reason for its metric score.

 Faithfulness measures the quality of your RAG pipeline's generator by evaluating whether the actual output factually aligns with the contents of your retrieval context. Deepeval's faithfulness metric also outputs a reason for its metric score.

 Contextual Recall measures the quality of your RAG pipeline's retriever by evaluating the extent to which the retrieval context aligns with the expected output. Deepeval's contextual recall metric also outputs a reason for its metric score.

 Answer Relevancy measures how relevant the actual output of your RAG pipeline looks compared to the provided input. Deepeval's answer relevancy metric also outputs a reason for its metric score.

The DeepEval evaluations were executed using the following setup:

 RAG generator LLM: HuggingFaceH4/zephyr-7b-alpha

 RAG retriever model: BAAI/bge-base-en

 RAG re-ranker: BAAI/bge-reranker-large

  Judge LLM to apply DeepEval’s metrics: GPT-3.5 Turbo

RAG Implementation Contextual Precision Score Contextual Recall Score Answer Relevancy Score Faithfulness Score
Standard RAG + Re-ranker 0.76 0.80 0.76 0.67
Sentence Window Retrieval 0.88 0.68 0.90 0.68

Auto Merging


0.67 0.74 0.93 0.77

As the table demonstrates, a particular RAG implementation may perform better on specific metrics, indicating their applicability to different use cases. Additionally, the evaluation metrics help identify what components of your RAG pipelines need adjustments to elevate the performance of the whole pipeline.


The improved RAG Starter Pack provides a valuable toolkit for those implementing RAG systems, featuring a series of well-documented Python notebooks designed to enhance LLMs by deepening contextual understanding. This pack includes advanced retrieval techniques and tools like DeepEval for system evaluation, which help reduce issues such as LLM hallucinations and improve the reliability of AI responses. The GitHub repository is well-structured, offering users clear step-by-step guidance that is easy to follow, even for non-data scientists. We hope our PAIF-N customers and partners find it helpful to get started with running GenAI applications on top of VMware VMware Cloud Foundation infrastructures. Stay tuned for upcoming articles to discuss critical aspects of production RAG pipelines related to safety and security.


Filter Tags

AI/ML Hardware Acceleration Blog