In our previous post we provided a high-level overview of the Foundation Models with Large Language Models and Diffusion Models. We also discussed the overview of how we can apply Large Language Models in practice with three primary approaches as In-Context Learning, Fine-Tuning and Pre-Training.
In this article, we take a closer look at the Retrieval Augmented Generation (RAG) framework, which takes advantage of LLM’s In-Context Learning capability and connects LLM to an external data source that was not present in its training corpus.
What is RAG and Why is it important?
Large Language Models have limitations, such as a limited context window and a knowledge cut-off date. To overcome these limitations, we can use a process called Retrieval Augmented Generation. Retrieval-Augmented Generation (RAG) is the idea of giving LLMs extra data from an external information source. This allows them to produce more precise and contextual answers while decreasing hallucinations.
LLMs are given a large amount of data to learn a wide range of general knowledge, which is stored in the neural network’s weights, also known as parametric knowledge. However, if we ask an LLM to generate an output or answer that requires information that was not included in its training data due to recent information, proprietary data, or domain-specific data, it may simply hallucinate factual inaccuracies.
Therefore, it is critical to fill the knowledge gap between the general knowledge of the LLM and any additional knowledge to support the LLM to produce more accurate and contextual results while reducing hallucinations.
We can adapt neural networks to fill this knowledge gap by fine-tuning the model. Although this is a traditional and effective technique, it requires significant computing power, technical expertise, and cost, and is comparatively less agile in adapting to new knowledge.
This is where Retrieval-Augmented Generation (RAG) comes into play:
Retrieval: Finding references in newer data, proprietary data, or domain-specific data.
Augmented: Adding references to the prompt along with the user query.
Generation: Feeding prompts (retrieval augmented prompts) to LLM to improve output.
RAG can be broken down into two primary steps:
- Retrieval as Dense Vector Retrieval (R): Utilizing vector embeddings to perform the retrieval process.
- In-Context Learning (AG): Enhancing the context of the prompt by incorporating all collected information into the context window of the LLM, enabling it to generate a relevant output or answer. This constitutes the in-context learning component.
In the previous post, we discussed In-Context Learning. In the context of RAG, we leverage this In-Context Learning ability of LLMs using the retrieval process. After retrieval, the augmentation occurs within the prompt template, which is then fed into the LLM. The LLM is engaged only in the final stage of the RAG system. Retrieval and augmentation occur prior to generation, which ultimately constitutes the question-answering phase.
Let’s delve deeper into the Retrieval process.
Dense Vector Retrieval:
The retrieval process consists of three steps:
- Ask a query.
- Search the database(s) for information similar to the query.
- Return the retrieved information.
Vector Database:
To achieve this, we often use a Vector database, typically referred to as an index. While there are many types of indexes, the vector database is just one possible type.
Vector Embedding:
Vector embedding is the process of representing words, sentences, or entire documents as dense, low-dimensional vectors in a mathematical space. These vector embeddings capture the semantic relationships between words, enabling algorithms to understand their contextual meaning. They provide a compact and meaningful representation of textual data.
Embedding model:
The embedding model transforms raw text into dense vector The embedding model transforms raw text into dense vector representations (vector embeddings) that capture semantic meaning. When building a vector database:
- Split documents into chunks.
- Create embeddings for each chunk by processing them through an embedding model.
- Store those embeddings in the vector database.
This process allows us to query in the vector database. When we ask a question, we convert it into a vector, search for similar vectors within the database, and return the relevant context. This is the retrieval aspect of RAG, where retrievers in frameworks like LangChain and LlamaIndex come into play. The retrieved context is then incorporated into the prompt for the final output.
When we ask a question, we convert it into a vector, search for similar vectors within our vector database, and return the relevant context to include in our prompt.
In simple terms, Retrieval involves
- Ask a query: Convert the text query into a vector embedding using an embedding model.
- Search a database: Use a vector database.
- Look for similarity: Perform vector similarity searches.
Finally, the retrieved information is required to be in a Natural Language again. This retrieval process involves finding relevant data and returning it in natural language as needed to interact with the LLM.
How can RAG be improved?
There are various techniques to transition from simple to advanced RAG. We will begin with basic techniques and then explore more advanced options.
Basic Techniques:
- Prompt Engineering – Analyze and customize the prompts used in the RAG system using advanced techniques.
- Choosing the right embedding model – Experiment with different embedding models, including those tailored for specific domains or languages.
- Customize the chunk size – Adjust the size of document chunks and their overlap during ingestion, impacting the calculated embeddings.
- Hybrid search – Combine results from both vector embedding similarity and keyword search methods in the retrieval process.
- Metadata filters – Here, metadata can be attached when sending documents to the vector store. This metadata proves useful in later stages for tracking sources of answers and can also be used during query time to filter data before conducting similarity searches.
These techniques can be implemented with frameworks such as Langchain, LlamaIndex, and others, which often include in-built features to support
Advanced Techniques:
Fine-Tuning Embeddings Model
This technique enhances the RAG system’s capability to retrieve the most relevant documents, thereby improving its overall performance. The focus on fine-tuning is crucial due to the specialized vocabulary used within the context of the RAG system. Ensuring that the system can effectively handle complex or specialized terms is paramount.
In-order to do fine-tuning what we need is to develop a training, validation and testing set of question and retrieve context pairs. Then we can use a loss function that actually takes all positive pairs and also automatically augments of data set with negative pairs. Simply put, when a question is posed, retrieving an incorrect or irrelevant context serves as a negative pair.
Once the embedding model is fine-tuned, we can evaluate it against a baseline model using metrics such as Mean Average Precision at K (MAP@K), Normalized Discounted Cumulative Gain at K (NDCG@K), Precision, and Accuracy to assess the effectiveness of the retrieval process.
Fine-Tuning LLM
Fine-tuning LLM is another advance technique which can be used to improve the RAG system. It involves taking the base model from initial training (which could be a general chat model or an instruction-tuned model) and training it on specific examples tailored to the domain, product, language, etc., where our RAG system will be implemented. This process updates the language model to excel in generating outputs that are relevant to its specific use case, thereby enhancing the overall performance of the RAG system.
A brief overview of fine-tuning was provided in the previous post within the context of LLM capabilities. We will delve deeper into this topic in an upcoming post, covering techniques such as RLHF, RLAIF, PEFT, LORA, etc. For the completeness of this discussion, it is worth mentioning here that fine-tuning is another advanced technique used to enhance RAG.
Advanced Retrieval Methods
There are a few advanced retrieval methods outlined below, which can be implemented using frameworks such as Langchain and LlamaIndex:
Re-ranking: This approach involves reordering and filtering documents to prioritize the most relevant ones at the top, aiming to prioritize contexts that are likely to provide accurate and relevant answers.
Multi-query: This method is an improved solution to mitigate strong query dependency and enhance result consistency. It retrieves multiple sets of documents based on varied interpretations of the original query. This is particularly advantageous when dealing with vague or imprecisely formulated queries.
RAG-Fusion: This is an advanced iteration of the above multi-query method. It works by expanding a single user query into multiple related queries, then using each query to perform a vector search and retrieve a variety of documents. The documents are then re-ranked using a ranking algorithm (Reciprocal Rank Fusion) to ensure the most relevant information is prioritized.
Small To Big Retrieval: Small-to-Big Retrieval involves using smaller text chunks during the retrieval process and subsequently providing the larger text chunk to which the retrieved text belongs. The returned context is expanded by considering the surrounding area before generating the final output.
Evaluating RAG through Retrieval Analysis:
When evaluating RAG, several key metrics come into play, including Correctness, Semantic Similarity, Faithfulness, and Context Relevancy.
Faithfulness and Context Relevancy:
In the analysis of retrieval, faithfulness and context relevancy are two important metrics. Faithfulness measures whether the answer is relevant to the context. Context relevancy measures whether the retrieved context and the answer are both relevant to the question. It’s essential that the question guides this process. Context relevancy and faithfulness are crucial for assessing retrieval quality. When optimizing retrieval, improvements in generation should ideally yield better results. Therefore, any evaluation tools we use should demonstrate improvements in these metrics.
Correctness and Semantic Similarity:
Correctness and Semantic similarity are on the generation side. Here, we compare the generated answer against a reference answer, which serves as the standard of acceptability. Utilizing powerful models such as GPT-4 allows us to generate reference answers when human-created ground truth datasets are insufficient.
To conclude
We explored Foundation Models previously, highlighting their practical applications like In-Context Learning, Fine-Tuning, and Pre-Training. This article focused on Retrieval-Augmented Generation (RAG), which enhances Large Language Models by incorporating external data for more accurate outputs. RAG mitigates issues such as limited context and outdated information by integrating current, domain-specific data. Techniques such as fine-tuning and advanced retrieval methods are pivotal in optimizing RAG’s performance, assessed through metrics like Correctness, Semantic Similarity, Faithfulness, and Context Relevancy across retrieval and generation phases.