Retrieval Augmented Generation (RAG) For Everyone
Retrieval Augmented Generation (RAG) For Everyone
for Everyone!
What is RAG 2
RAG Components 3
Advantages of Retrieval Augmented Generation 3
tti
Systematic RAG Workflow 4
ga
RAG Retrieval Sources 6
RAG Tutorial 7
Evolution of RAG Over Time 8
la
RAG Design Choices 9
Chunking Strategies in RAG 10
Be
RAG Using LangChain 12
Advanced RAG 13
Reranking in RAG 14
Types of Embedding Models for RAG
n 16
va
Semantic Chunking in RAG Applications 18
Retrieval Pain Points in RAG 19
RAG Enhancement Techniques 21
Pa
tti
New to the world of Retrieval Augmented Generation (RAG)? We've got you covered with this
in-depth guide.
ga
Large language models (LLMs) are becoming the backbone of most of the organizations these
days as the whole world is making the transition towards AI. While LLMs are all good and
la
trending for all the positive reasons, they also pose some disadvantages if not used properly.
Yes, LLMs can sometimes produce the responses that aren’t expected, they can be fake, made
Be
up information or even biased. Now, this can happen for various reasons. We call this process
of generating misinformation by LLMs as hallucination.
n
There are some notable approaches to mitigate the LLM hallucinations such as fine-tuning,
prompt engineering, retrieval augmented generation (RAG) etc. Retrieval augmented generation
va
(RAG) has been the most talked about approach in mitigating the hallucinations faced by large
language models. Today we will see everything about the RAG approach, what it is, how it
Pa
What is RAG
by
improve response accuracy and relevance, mitigating issues like misinformation and outdated
knowledge in generated content. So, RAG basically reduces the LLM hallucinations by providing
contextually relevant responses through the data sources provided/attached.
at
re
C
RAG Components
tti
ga
la
Be
The RAG pipeline basically involves three critical components: Retrieval component,
n
Augmentation component, Generation component.
va
● Retrieval: This component helps you fetch the relevant information from the external
knowledge base like a vector database for any given user query. This component is very
Pa
crucial as this is the first step in curating the meaningful and contextually correct
responses.
● Augmentation: This part involves enhancing and adding more relevant context to the
by
● Generation: Finally, a final output is presented to the user with the help of a large
language model (LLM). The LLM uses its own knowledge and the provided context and
ed
● Scalability. RAG approach helps you with scale models by simply updating or adding
C
● Memory efficiency. Traditional models like GPT have limits when it comes to pulling
fresh and updated information and fails to be memory efficient. RAG leverages external
databases like a vector database — allowing it to pull in fresh, updated or detailed
information when needed with speed.
● Flexibility. By updating or expanding the external knowledge source, you can adapt
RAG to build any AI applications with flexibility.
Systematic RAG Workflow
tti
ga
la
Be
n
va
Pa
Retrieval module, Augmentation module, and Generation module (as discussed above).
by
First, the document which forms the source database is divided into chunks. These chunks,
transformed into vectors using an embedding model like OpenAI or open source models
ed
available from Hugging Face community, are then embedded into a high-dimensional vector
database (e.g., SingleStore Database, Chroma and LlamaIndex).
at
When the user inputs a query, the query is embedded into a vector using the same embedding
model. Then, chunks whose vectors are closest to the query vector, based on some similarity
re
metrics (e.g., cosine similarity) are retrieved. This process is contained in the retrieval module
shown in the figure. After that, the retrieved chunks are augmented to the user’s query and the
C
This step is critical for making sure that the records from the retrieved documents are effectively
incorporated with the query. Then, the output from the augmentation module is fed to the
generation module which is responsible for generating an accurate answer to the query by
utilizing the retrieved chunks and the prompt through an LLM (like chatGPT by OpenAI, hugging
face, and Gemini by Google).
But to make RAG work perfectly, here are some key points to consider:
1. Quality of External Knowledge Source: The quality and relevance of the external
knowledge source used for retrieval are crucial.
2. Embedding Model: The choice of the embedding model used for retrieving relevant
documents or passages from the knowledge source is important.
3. Chunk Size and Retrieval Strategy: Experiment with different chunk sizes to find the optimal
length for context retrieval. Larger chunks may provide more context but could also introduce
tti
irrelevant information. Smaller chunks may focus on specific details but might lack broader
context.
ga
4. Integration with Language Model: The way the retrieved information is integrated with the
la
language model's generation process is crucial. Techniques like cross-attention or
memory-augmented architectures can be used to effectively incorporate the retrieved
Be
information into the model's output.
5. Evaluation and Fine-tuning: Evaluating the performance of the RAG model on relevant
datasets and tasks is important to identify areas for improvement. Fine-tuning the RAG model
n
on domain-specific or task-specific data can further enhance its performance.
va
6. Ethical Considerations: Ensure that the external knowledge source is unbiased and does
not contain offensive or misleading information.
Pa
7. Handling Out-of-Date or Incorrect Information: It's important to have strategies in place for
handling situations where the retrieved information is out-of-date or incorrect.
by
Use SingleStore Database as your vector store, try for free: https://bit.ly/SingleStoreDB
ed
at
re
C
RAG Retrieval Sources
tti
ga
la
Be
n
va
Do you know how RAG applications acquire external knowledge?
RAG systems can leverage various types of retrieval sources to acquire external knowledge.
Pa
textual sources.
⮕ Semi-Structured Data (PDF): PDF documents, such as research papers, reports, and
manuals, contain a mix of textual and structural information.
ed
⮕ Structured Data (Knowledge Graphs): Knowledge graphs, such as Wikipedia and Freebase,
at
⮕ LLM-Generated Content: Recent advancements have shown that LLMs themselves can
generate high-quality content that can be used as a retrieval source. This approach leverages
C
the knowledge captured within the LLM's parameters to generate relevant information.
All this data gets converted into embeddings and gets stored in a vector database. When a user
query comes in, it also gets converted into an embedding (query embedding) and the most
relevant answer will be retrieved using semantic search. The vector database becomes
knowledge base to search for the contextually relevant answer.
Additionally, one more aspect to consider is retrieval granularity. It refers to the level at which
knowledge is retrieved from the sources.
tti
that contain relevant information. It strikes a balance between specificity and context, making it
suitable for a wide range of tasks.
ga
⮕ Chunk-Level Retrieval: Chunk-level retrieval involves retrieving larger chunks of text, such as
la
paragraphs or sections. It provides more comprehensive information and context but may
introduce noise and irrelevant details.
Be
⮕ Document-Level Retrieval: Document-level retrieval retrieves entire documents that are
relevant to the query. While it offers the most extensive context, it may require additional
processing to extract the most pertinent information.
n
va
Know more about knowledge retrieval in RAG: https://ingestai.io/blog/knowledge-retrieval-in-rag
RAG Tutorial
Pa
Let’s build a simple AI application that can fetch the contextually relevant information from our
own data for any given user query.
by
tti
ga
la
Be
n
va
Let's talk about the RAG evolution over time.
Pa
1. Naive RAG:
The Naive RAG research paradigm represents the earliest methodology, which gained
prominence shortly after the widespread adoption of ChatGPT. The Naive RAG follows a
by
traditional process that includes indexing, retrieval, and generation. It is also characterized as a
“Retrieve-Read” framework [Ma et al., 2023a].
ed
2. Advanced RAG:
Advanced RAG has been developed with targeted enhancements to address the shortcomings
of Naive RAG. In terms of retrieval quality, Advanced RAG implements pre-retrieval and
at
fine-grained segmentation, and metadata. It has also introduced various methods to optimize
the retrieval process [ILIN, 2023].
C
3. Modular RAG:
The modular RAG structure diverges from the traditional Naive RAG framework, providing
greater versatility and flexibility. It integrates various methods to enhance functional modules,
such as incorporating a search module for similarity retrieval and applying a fine-tuning
approach in the retriever [Lin et al., 2023].
Restructured RAG modules [Yu et al., 2022] and iterative methodologies like [Shao et al., 2023]
have been developed to address specific issues. The modular RAG paradigm is increasingly
becoming the norm in the RAG domain, allowing for either a serialized pipeline or an end-to-end
training approach across multiple modules.
This comprehensive review paper offers a detailed examination of the progression of RAG
paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG.
tti
RAG Design Choices
ga
la
Be
n
va
Pa
by
When designing the indexing step, there are a few design choices to make:
• Data processing mode
• Indexing model
• Text splitting method
• Chunking hyperparameters
The best embedding models might be different than the best LLMs in general.
When designing the storing step of a RAG pipeline, the two most important decisions are:
• Database choice
• Metadata selection
Sometimes finding a vector database might be very confusing due to so many databases
available today. SingleStore database started supporting vector storage long back in 2017 itself.
I would highly recommend choosing SingleStore as your vector database for all your AI/ML
tti
applications.
ga
[ Try SingleStore for free: https://bit.ly/SingleStoreDB ]
la
There are a few things you would need to think about when designing the retrieval step:
• Retrieval strategy
Be
• Retrieval hyperparameters
• Query transformations
Know in detail about each step and useful considerations in this original guide:
https://towardsdatascience.com/designing-rags-dbb9a7c1d729
by
BUT the question is, what should be the right chunking strategy?
Chunking is the method of breaking down the large files into more manageable
segments/chunks so the LLM applications can get proper context and the retrieval can be easy.
In a video on YouTube, Greg Kamradt provides overview of different chunking strategies. Let’s
understand them one by one.
tti
They have been classified into five levels based on the complexity and effectiveness.
ga
⮕ Level 1 : Fixed Size Chunking
This is the most crude and simplest method of segmenting the text. It breaks down the text into
la
chunks of a specified number of characters, regardless of their content or structure.Langchain
and llamaindex framework offer CharacterTextSplitter and SentenceSplitter (default to spliting
Be
on sentences) classes for this chunking technique.
n
Recursive chunking offers an alternative. In this method, we divide the text into smaller chunk in
va
a hierarchical and iterative manner using a set of separators. Langchain framework offers
RecursiveCharacterTextSplitter class, which splits text using default separators (“\n\n”, “\n”, “
“,””)
Pa
structure.
All above three levels deals with content and structure of documents and necessitate
maintaining constant value of chunk size. This chunking method aims to extract semantic
at
meaning from embeddings and then assess the semantic relationship between these chunks.
The core idea is to keep together chunks that are semantic similar.Llamindex has
re
SemanticSplitterNodeParse class that allows to split the document into chunks using contextual
relationship between chunks.
C
tti
ga
la
Be
n
LangChain is a powerful framework for LLM-powered applications.
va
1. It provides a standard interface for chains, enabling developers to create sequences of calls
that go beyond a single LLM call.
Pa
3. It simplifies the process of working with LLMs and provides tools for prompt management,
memory, indexing, and agent-based decision-making.
ed
4. Once the data is stored in the database, Langchain supports various retrieval algorithms.
These include basic semantic search, parent document retriever, self-query retriever, ensemble
retriever, and more.
5. When conducting a search, the retrieval system assigns a score or ranking to each document
based on its relevance to the query.
tti
Here is my guide on implementing RAG using LangChain: A Step-by-Step Guide -
https://levelup.gitconnected.com/implementing-rag-using-langchain-and-singlestore-a-step-by-st
ga
ep-guide-2a579da1de0c
la
Advanced RAG
Be
n
va
Pa
by
ed
at
re
C
Let’s use some simple query examples from the basic RAG explanation: “What’s the latest
breakthrough in renewable energy?”, to better understand these advanced techniques.
⮕ Pre-retrieval optimizations: Before the system begins to search, it optimizes the query for
better outcomes. For our example, Query Transformations and Routing might break down the
query into sub-queries like “latest renewable energy breakthroughs” and “new technology in
renewable energy.”
This ensures the search mechanism is fine-tuned to retrieve the most accurate and relevant
information.
⮕ Enhanced retrieval techniques: During the retrieval phase, Hybrid Search combines
keyword and semantic searches, ensuring a comprehensive scan for information related to our
query. Moreover, by Chunking and Vectorization, the system breaks down extensive documents
into digestible pieces, which are then vectorized.
This means our query doesn’t just pull up general information but seeks out the precise
tti
segments of texts discussing recent innovations in renewable energy.
ga
⮕ Post-retrieval refinements: After retrieval, Reranking and Filtering processes evaluate the
gathered information chunks. Instead of simply using the top ‘k’ matches, these techniques
la
rigorously assess the relevance of each piece of retrieved data. For our query, this could mean
prioritizing a segment discussing a groundbreaking solar panel efficiency breakthrough over a
Be
more generic update on solar energy.
This step ensures that the information used in generating the response directly answers the
query with the most relevant and recent breakthroughs in renewable energy.
n
va
Know more in the original article: https://datasciencedojo.com/blog/rag-vs-finetuning-llm-debate/
Reranking in RAG
Pa
by
ed
at
re
C
Then, a re-ranker mechanism will take this candidate document list and re-rank the elements.
With Rerank, we can improve your models by re-organizing your results based on certain
parameters.
tti
context window(context stuffing)
ga
⮕ Basic Idea behind reranking is to filter down the total number of documents into a fixed
number .
la
⮕ The re-ranker will re-rank the records and get the most relevant items at the top and they can
Be
be sent to the LLM
⮕ The Reranking offers a solution by finding those records that may not be within the top 3
results and put them into a smaller set of results that can be further fed into the LLM
n
va
Reranking basically enhance the relevance and precision of retrieved results.
https://medium.aiplanet.com/advanced-rag-cohere-re-ranker-99acc941601c
by
ed
at
re
C
Types of Embedding Models for RAG
tti
ga
la
Be
n
va
How to Select an Embedding Model for Your RAG Application?
Pa
Embeddings form the foundation for achieving precise and contextually relevant LLM outputs
across different tasks.
by
Which encoder you select to generate embeddings is a critical decision, hugely impacting the
overall success of the RAG system. Low quality embeddings lead to poor retrieval.
ed
When selecting an embedding model, consider the vector dimension, average retrieval
performance, and model size.
at
Companies such as OpenAI, Cohere, and Voyage consistently release enhanced embedding
models.
re
Different types of embeddings are designed to address unique challenges and requirements in
C
different domains.
⮕ Multi-vector embedding models like ColBERT feature late interaction, where the interaction
between query and document representations occurs late in the process, after both have been
independently encoded.
⮕ Long documents have always posed a particular challenge for embedding models. The
tti
limitation on maximum sequence lengths, often rooted in architectures like BERT, leads to
practitioners segmenting documents into smaller chunks. Unfortunately, this segmentation can
ga
result in fragmented semantic meanings and misrepresentation of entire paragraphs.
la
⮕ Variable dimension embeddings are a unique concept built on Matryoshka Representation
Learning (MRL). MRL learns lower-dimensional embeddings that are nested into the original
Be
embedding, akin to a series of Matryoshka Dolls.
⮕ Code embeddings are a recent development used to integrate AI-powered capabilities into
Integrated Development Environments (IDEs), fundamentally transforming how developers
interact with codebases.
n
va
There are several factors that need to be considered while selecting an embedding model.
Pa
No matter which embedding model you use, having a robust database is a must for your RAG
by
application.
Use SingleStore as your vector database to build your AI/ML apps. Sign up & use for free:
ed
https://bit.ly/SingleStoreDB
at
re
C
Semantic Chunking in RAG Applications
tti
ga
la
Be
Chunking in RAG applications involves breaking down large pieces of data into smaller,
n
manageable segments or “chunks.” This process enhances the efficiency and accuracy of
va
information retrieval by enabling the model to handle more precise and relevant portions of data.
In RAG systems, when a query is made, the model searches through these chunks to find the
Pa
Chunking is especially useful in scenarios where documents are lengthy or contain diverse
topics, as it ensures that the retrieved data is contextually appropriate and precise.
ed
Naive chunking strategies limit themselves with dividing the text into chunks of a fixed number
of words or characters, and not always effective.
at
Semantic Chunking is a method that focuses on extracting and preserving the semantic
re
meaning within text segments. By utilizing embeddings to capture the underlying semantics, this
approach assesses the relationships between different chunks to ensure that similar content is
C
kept together.
By focusing on the text’s meaning and context, Semantic Chunking significantly enhances
retrieval quality. It’s ideal for maintaining semantic integrity, ensuring coherent and relevant
information retrieval.
Let's see how semantic chunking is better than your naive chunking strategies in my tutorial:
https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942a
f0
tti
ga
la
Be
n
va
Pa
by
Effective retrieval is a pain, and you can encounter several issues during this important stage.
Here are some common pain points and possible solutions in the retrival stage.
at
⮕ Challenge: Retrieved data not in context & there can be several reasons for this.
re
➤ Missed Top Rank Documents: The system sometimes doesn’t include essential documents
that contain the answer in the top results returned by the system’s retrieval component.
C
➤ Incorrect Specificity: Responses may not provide precise information or adequately address
the specific context of the user’s query
➤ Losing Relevant Context During Reranking: This occurs when documents containing the
answer are retrieved from the database but fail to make it into the context for generating an
answer.
⮕ Proposed Solutions:
➤ Query Augmentation: Query augmentation enables RAG to retrieve information that is in
context by enhancing the user queries with additional contextual details or modifying them to
maximize relevancy. This involves improving the phrasing, adding company-specific context,
and generating sub-questions that help contextualize and generate accurate responses
- Rephrasing
- Hypothetical document embeddings
- Sub-queries
tti
➤ Tweak retrieval strategies: Llama Index offers a range of retrieval strategies, from basic to
ga
advanced, to ensure accurate retrieval in RAG pipelines. By exploring these strategies,
developers can improve the system’s ability to incorporate relevant information into the context
la
for generating accurate responses.
- Small-to-big sentence window retrieval,
Be
- recursive retrieval
- semantic similarity scoring.
➤ Hyperparameter tuning for chunk size and similarity_top_k: This solution involves adjusting
n
the parameters of the retrieval process in RAG models. More specifically, we can tune the
va
parameters related to chunk size and similarity_top_k.
The chunk_size parameter determines the size of the text chunks used for retrieval, while
Pa
➤ Reranking: Reranking retrieval results before they are sent to the language model has
proven to improve RAG systems’ performance significantly.
ed
Know more about the other pain points & possible solutions explained in detail:
re
https://datasciencedojo.com/blog/rag-challenges-in-llm-applications/
C
RAG Enhancement Techniques
tti
ga
la
Be
n
va
Pa
by
You need to know some techniques to overcome different challenges that RAG throws at you
at
traditional methods where only one query is used, Multi-Query generates multiple queries and
retrieves similar documents for each one. Builders utilize Multi-Query primarily for two reasons:
enhancing suboptimal queries and expanding result sets. It addresses users’ imperfect queries
by filling in gaps and retrieves more diverse results, leading to an expanded results set that can
provide better answers than single-query documents.
tti
important details, while opting for a larger size could introduce irrelevant information.
ga
4. Incorporation of metadata with indexed vectors:
Adding metadata alongside indexed vectors in the vector database offers significant benefits in
la
organizing and enhancing search relevance.
Be
5. Improving search relevance with question-based indexing:
LLMs and RAGs offer incredible power by allowing users to express queries in natural
language, simplifying data exploration and complex tasks. However, a common challenge arises
when there’s a disconnect between the concise queries, users input and the longer, more
n
detailed documents stored in the system.
va
6. Improving Search Precision with Mixed Retrieval — Hybrid Search
While vector search excels in retrieving semantically relevant chunks for queries, it sometimes
Pa
lacks precision in matching specific keywords. To get the best of both the worlds, (vector search
+ full-text search) you need hybrid search.
https://blog.stackademic.com/rag-understanding-the-concept-and-various-enhancement-techniq
ues-608b643bf2e5
ed
No matter what RAG technique you choose, you would always need a robust database to store
your vector data, make sure to use SingleStore as your vector database.
at
tti
ga
la
Be
n
va
Pa
Depending on your use case, the requirements change. Whether it is about selecting a smart
model, chunking strategy, embedding method and models, vector databases, evaluation
by
To make RAG work perfectly, here are some key points to consider:
ed
2. Data Indexing Optimizations: Techniques such as using sliding windows for text chunking and
at
3. Query Enhancement: Modifying or expanding the initial user query with synonyms or broader
terms to improve the retrieval of relevant documents.
C
4. Embedding Model: The choice of the embedding model used for retrieving relevant
documents.
5. Chunk Size & Retrieval Strategy: Experiment with different chunk sizes to find the optimal
length for context retrieval.
6. Integration with Language Model: The way the retrieved information is integrated with the
language model's generation process is crucial.
7. Evaluation & Fine-tuning: Evaluating the performance of the RAG model on relevant datasets
and tasks is important to identify areas for improvement.
8. Ethical Considerations: Ensure that the external knowledge source is unbiased and does not
contain offensive or misleading information.
tti
9. Vector database: Having a vector database that supports fast ingestion, retrieval
performance, hybrid search is utmost important.
ga
10. Response Summarization: Condensing retrieved text to provide concise and relevant
la
summaries before final response generation.
Be
11. Re-ranking and Filtering: Adjusting the order of retrieved documents based on relevance
and filtering out less pertinent results to refine the final output.
12. LLM models: Consider LLM models that are robust and fast enough to build your RAG
application.
n
va
13. Hybrid Search: Combining traditional keyword-based search with semantic search using
embedding vectors to handle a variety of query complexities.
Pa
No matter what RAG technique you choose, you would always need a robust vector database to
store your vector data, make sure to use SingleStore as your vector database.
by
While RAG enhances this capability to certain extent, integrating a semantic cache layer in
between that will store various user queries and decide whether to generate the prompt
enriched with information from the vector database or the cache is a must.
A semantic caching system aims to identify similar or identical user requests. When a matching
tti
request is found, the system retrieves the corresponding information from the cache, reducing
the need to fetch it from the original source.
ga
There are many solutions that can help you with the semantic caching but I can recommend
la
using SingleStore database.
Be
Why use SingleStore Database as the semantic cache layer?
SingleStoreDB is a real-time, distributed database designed for blazing fast queries with an
architecture that supports a hybrid model for transactional and analytical workloads.
n
This pairs nicely with generative AI use cases as it allows for reading or writing data for both
va
training and real-time tasks — without adding complexity and data movement from multiple
products for the same task.
Pa
SingleStoreDB also has a built-in plancache to speed up subsequent queries with the same
plan.
tti
ga
la
Be
n
Basic RAG is limited in handling complex tasks like summarization, comparison, and multi-part
va
questions. It is primarily useful for simple questions over small datasets but struggles with more
sophisticated queries.
Pa
There are two ways you can improve your RAG pipeline.
1. Improve your data
2. Improve your querying
by
You can use a framework such as LlamaIndex and its toolkit to improve both.
If you are new to LlamaIndex, it is a framework in Python and TypeScript for building
LLM-enabled applications over various data sources. They offer open-source tools and a paid
ed
service, Llama Cloud, for building and scaling data retrieval systems.
at
You can improve your data using LlamaParse. LlamaParse is an API created by LlamaIndex to
efficiently parse and represent files for efficient retrieval and context augmentation using
re
LlamaIndex frameworks.
C
You can use a vector database like SingleStore database to store the vector embeddings.
[ Try SingleStore for Free: https://bit.ly/SingleStoreDB ]
You can improve the quality of your data is through LlamaHub. LlamaHub :llama: This is a
simple library of all the data loaders / readers that have been created by the community. The
goal is to make it extremely easy to connect large language models to a large variety of
knowledge sources. It includes data loaders, tools, vector databases, LLMs and more.
Then comes the agentic RAG.
Agents can enhance RAG by incorporating multi-turn interactions, query understanding, tool
use, reflection, and memory, addressing the limitations of naive RAG pipelines.
Agentic RAG allows AI systems to engage in iterative reasoning — understanding the full
context, gathering missing information through back-and-forth dialog, calling external data
sources and APIs as needed, and stitching together multi-part solutions that address the core
problem in a nuanced and tailored way.
tti
This iterative reasoning capability is crucial for enterprises to handle complex use cases across
domains. That’s why many enterprises are adopting agentic RAG over rigid regular RAG.
ga
Components of Agentic RAG:
la
⮕ Routing: Uses LLM to select the best tool for a query.
⮕ Memory: Retains query history to provide context for future queries.
Be
⮕ Query Planning: Breaks complex questions into simpler ones and aggregates the responses.
Know more about improving your RAG pipeline through this video:
https://www.youtube.com/watch?v=MXPYbjjyHXc
n
va
Metrics for RAG Performance
Pa
by
ed
at
re
C
The key dimensions for RAG (Retrieval-Augmented Generation) performance focus on both
retrieval and generation aspects.
Retrieval metrics include context recall, precision, and relevance, ensuring retrieved information
matches the query accurately.
Generation metrics emphasize faithfulness, relevance, and fluency of the generated text.
Key metrics like accuracy, cosine similarity, NDCG, BLEU, and F1 score evaluate overall
correctness, relevance, and quality.
tti
Operational metrics such as latency, user satisfaction, and redundancy address practical
performance concerns.
ga
Together, these metrics provide a comprehensive framework for assessing the effectiveness
la
and reliability of RAG systems.
Be
Also, no matter what you consider of utmost importance, having a robust data platform for fast
data ingestion and retrieval. A data platform that can help you with all types of data and not just
vector data.
n
SingleStore is one such data platform that can be used as a vector database and also for any
va
real-time AI applications.
Know more about key dimensions & metrics for RAG performance in this article:
https://sunila-gollapudi.medium.com/rag-key-aspects-for-performance-metrics-and-measuremen
t-c41b1aa18499
by
ed
at
re
C
RAG Approaches
tti
ga
la
Be
n
va
RAG is no longer just about retrieval- it's about smart, self-improving intelligence!
Pa
We were all so excited when RAG was first introduced. We still are, this is never ending. I mean,
RAG will still remain relevant for atleast a year from now (just my opinion).
by
So, RAG was first introduced by Meta AI researchers in 2020 through their paper —
Retrieval-Augmented Generation for Knowledge-Intensive NLP Task— to address those kinds
of knowledge-intensive tasks.
ed
We saw a surge of simple to advanced RAG chatbots which is now taken over by AI agents:)
at
Coming to over RAG evolution over time. It all started with simple naive approach to retrieve
contextually relevant responses/info and then moved on to what we call today corrective RAG.
re
While Standard RAG enhances response accuracy by retrieving and incorporating relevant
C
documents into the generative process, Self-reflective RAG improves upon this by having the
model assess its own outputs, tagging retrieved documents as relevant or irrelevant, and
adjusting its responses accordingly.
Corrective RAG takes this a step further by using an external model to classify retrieved
documents as correct, ambiguous, or incorrect, allowing the generative model to correct its
answers based on this classification.
Together, these approaches represent increasing levels of refinement and accuracy in
generating reliable responses.
tti
Advanced RAG Techniques
ga
la
Be
n
va
Pa
by
Building a simple RAG pipeline is easy. But, that doesn't yield anything.
ed
⮕ Query Enhancement: Modifying or expanding the initial user query with synonyms or broader
terms to improve the retrieval of relevant documents.
⮕ Hybrid Search: Combining traditional keyword-based search with semantic search using
embedding vectors to handle a variety of query complexities.
⮕ Fine Tuning Embedding Model: Adjusting a pre-trained model to better understand specific
domain nuances, enhancing the accuracy and relevance of retrieved documents.
⮕ Re-ranking and Filtering: Adjusting the order of retrieved documents based on relevance and
filtering out less pertinent results to refine the final output.
tti
Adopting a robust database that can do hybrid search, has great integration with AI frameworks,
can help you will fast ingestion and vector storage is very important.
ga
This is where SingleStore database comes handy. Sign up & use it for free:
la
https://bit.ly/SingleStoreDB
Be
The complete article on advanced RAG techniques by Necati Demir is here:
https://blog.demir.io/advanced-rag-implementing-advanced-techniques-to-enhance-retrieval-aug
mented-generation-systems-0e07301e46f4
Multi-Query Retrieval is a type of query expansion. Query expansion works by extending the
original query with additional terms or phrases that are related or synonymous.
The aim of multi-query is to have an expanded results sets which might be able to answer
questions better than docs from a single query.
MultiQuery Retriever performs an automated tuning process by using LLM to generate several
different queries for a given user input query from different perspectives.
For each query, it retrieves a set of relevant documents and employs a unique concatenation
between all queries to obtain a larger set of potentially relevant documents.
tti
By generating queries for multiple perspectives on the same question, MultiQuery Retriever may
be able to overcome some of the limitations of similarity search and obtain a richer result set.
ga
The MultiQuery Retriever empowers users to perform complex queries across multiple data
la
sources simultaneously. It leverages a combination of semantic understanding and probabilistic
models to deliver highly relevant results.
Be
You can use multi-query retrievers from LangChain & LlamaIndex.
https://teetracker.medium.com/langchain-llama-index-rag-with-multi-query-retrieval-4e7df1a62f8
3
by
ed
at
re
C
Custom RAG Chatbot
tti
ga
la
Be
n
Let's build a custom RAG chatbot using LangChain!
va
RAG makes it possible to chat with our custom data & this is what we all need.
Pa
The image below shows a simple workflow of the same. It's a chatbot that uses LangChain
framework to chain everything, that includes a vector database, splitting mechanism, prompt
template, etc.
by
In this tutorial I have used a publicly available txt file, chunked the content of the file and
converted into embeddings, stored the embeddings in a vector database like SingleStore. I am
using gpt-3.5-turbo-instruct as the LLM to construct the prompt and answer back after receiving
ed
Now, for any query, the chatbot responds back with a proper answer using vector search without
at
hallucinating since it has all knowledge base connected to it (the vector database).
re
Also, you can understand more about vector databases in my YouTube video:
https://youtu.be/YPppSOk7yI4
Robust and Safe RAG Overview
tti
ga
la
Be
n
How to build a robust & safe RAG pipeline?
va
An attacker can inject malicious passages into retrieval results to induce inaccurate responses.
Pa
Yes, despite its popularity, the RAG pipeline can become fragile when some of the retrieved
passages are compromised by malicious actors, a type of attack we term retrieval corruption.
by
These attacks raise the research question of how to build a robust RAG pipeline.
This paper proposes a defense framework named 'RobustRAG' that aims to perform robust
ed
(1) it computes LLM responses from each passage in isolation and then
(2) securely aggregates isolated responses to generate the final output.
re
The isolation operation ensures that the malicious passages cannot affect LLM responses for
C
other benign passages and thus lays the foundation for robustness.
RobustRAG overview:
In the below image example, one of the three retrieved passages is corrupted. Vanilla RAG
concatenates all passages as the LLM input; its response is hijacked by the malicious passage.
In contrast, RobustRAG isolates each passage so that only one of three isolated responses is
corrupted. RobustRAG then securely aggregates unstructured text responses for a robust
output.
Know more about RobustRAG in the original paper: https://lnkd.in/gtXGfTqJ
tti
ga
la
Be
n
va
Pa
by
ed
at
re
C
Implementing RAG Using LangChain and SingleStore
tti
ga
la
Be
n
va
But to build LLM-powered applications, LLMs are not enough.
You need to have supporting tools, frameworks, integrations, and an approach to make sure the
Pa
This article is written with one goal of making sure even a non-technical person can understand
by
ep-guide-2a579da1de0c
at
re
C
Modular RAG Framework
tti
ga
la
Be
n
va
Pa
by
It seamlessly integrates the development paradigms of Naive RAG and Advanced RAG.
at
Modular RAG presents a highly scalable paradigm, dividing the RAG system into a three-layer
structure of Module Type, Modules, and Operators.
re
Each Module Type represents a core process in the RAG system, containing multiple functional
C
The entire RAG system becomes a permutation and combination of multiple modules and
corresponding operators, forming what we refer to as RAG Flow.
Within the Flow, different functional modules can be selected in each module type, and within
each functional module, one or more operators can be chosen.
The Modular RAG organizes the RAG system in a multi-tiered modular form.
Modular RAG is highly scalable, facilitating researchers to propose new Module Types,
Modules, and operators based on a comprehensive understanding of the current RAG
development.
The design and construction of RAG systems become more convenient, allowing users to
customize RAG Flow based on their existing data, usage scenarios, downstream tasks, and
tti
other requirements.
ga
Know more about modular RAG:
https://medium.com/@yufan1602/modular-rag-and-rag-flow-part-%E2%85%B0-e69b32dc13a3
la
Adaptive RAG
Be
n
va
Pa
by
ed
at
'Adaptive RAG' is another type of agentic RAG that can accommodate its strategies to various
re
tti
Know more in the original article:
ga
https://medium.com/@infiniflowai/agentic-rag-definition-and-low-code-implementation-d0744815
029c
la
Be
n
va
Pa
by
ed
at
re
C
Advanced RAG Using LlamaIndex and Claude 3
tti
ga
la
Be
n
va
Advanced RAG aims to address the limitations of Naive RAG.
Pa
Advanced RAG uses more sophisticated LLMs like Claude 3 and AI frameworks and
functionalities from LlamaIndex & LangChain.
by
The chunking strategies will be applied based on the type of data source & documents size.
With LLMs like Claude 3, we see a new breed of advanced RAG known as 'Multimodal RAG'.
ed
integrating data modalities beyond just text, such as images, audio, video, and even tactile or
olfactory information.
re
And this has been possible with the rise of multimodal LLMs.
C
OpenAI’s GPT-4V(ision), Google’s Gemini and Anthropic’s Claude-3 series are some notable
examples of multimodal models that are revolutionizing the AI industry.
tti
Here is the notebook code you can try:
https://github.com/singlestore-labs/webinar-code-examples/blob/main/Claude%203%20Multimo
ga
dal.ipynb
la
Advanced RAG Using RAPTOR
Be
n
va
Pa
by
ed
at
re
When working with long-context documents, we cannot just chunk the documents and embed
them. Instead we would want to have a good approach for minimalist document splitting for long
context LLMs. This is where RAPTOR comes into picture.
Recursive Abstractive Processing for Tree Organized Retrieval [RAPTOR] is a new and
powerful indexing and retrieving technique for LLM in a comprehensive manner. It adapts a
bottom-up approach by clustering and summarizing text segments(chunks) to form a
hierarchical tree structure.
tti
We can apply this at varying scales; leaves can be:
ga
→ Text chunks from a single doc
→ Full docs
la
With longer context LLMs, it’s possible to perform this over full documents.
Be
This tree structure is key to RAPTOR function as it captures bot high level and detailed aspects
of text, which is particularly useful for complex thematic queries and multi-step reasoning in
questioning and answering tasks.
n
This process involves segmenting documents into shorter texts called chunks and then
va
embedding the chunks using an embeding model.
These embeddings are then clustered by a clustering algorithm. Once clusters are created, the
Pa
The summaries generated form nodes in a tree with higher level nodes providing more abstract
summaries.
by
c503c6
re
C
Agentic RAG Using LlamaIndex
tti
ga
la
Be
n
va
Agentic RAG is the best solution for your AI applications.
Pa
Agentic RAG is more suitable for complex, dynamic research tasks, offering greater flexibility
and precision.
by
incorporating reasoning and decision-making capabilities over user data, allowing for more
complex queries and autonomous research agents.
at
Agentic RAG extends regular RAG by incorporating advanced reasoning, multi-step processing,
re
and tool usage capabilities. Regular RAG retrieves context and generates responses in a single
step, suitable for simple queries.
C
Key components include building a router query engine, defining query tools, and implementing
multi-document agents.
The framework aims to improve the interaction with large language models (LLMs) by adding
detailed control, oversight, and debugging capabilities, ultimately creating a more sophisticated
research assistant.
tti
Know more in this original article:
ga
https://medium.com/@sulaiman.shamasna/rag-iv-agentic-rag-with-llamaindex-b3d80e09eae3
la
Learn how to built agentic RAG & AI chatbots from my YouTube channel:
https://www.youtube.com/@pavanbelagatti
Be
Building a Multimodal RAG Workflow
n
va
Pa
by
ed
at
re
C
tti
This approach enhances the model's ability to produce accurate and contextually rich outputs by
leveraging diverse data types, leading to more comprehensive and informed AI-generated
ga
content.
la
Let's learn more about multimodal models & build a simple multimodal RAG setup:
https://youtu.be/XNd3MiHTma4
Be
Will be using Anthropic's Claude 3 Haiku model as our multimodal model and SingleStore as
our vector database.
Sign up to SingleStore for free to get started: https://bit.ly/SingleStoreDB
n
va
Agentic RAG Using CrewAI & LangChain
Pa
by
ed
at
re
C
In the rapidly evolving field of artificial intelligence, Agentic RAG has emerged as a
game-changing approach to information retrieval and generation. This advanced technique
combines the power of Retrieval Augmented Generation (RAG) with autonomous agents,
offering a more dynamic and context-aware method to process and generate information.
As businesses and researchers seek to enhance their AI capabilities, understanding and
implementing Agentic RAG has become crucial to staying ahead in the competitive landscape.
This guide delves into the intricacies of mastering Agentic RAG using two powerful tools:
LangChain and CrewAI. It explores the evolution from traditional RAG to its agentic counterpart,
highlighting the key differences and benefits. The article also examines how LangChain serves
as the foundation for implementing Agentic RAG and demonstrates the ways CrewAI can be
leveraged to create more sophisticated and efficient AI systems.
tti
Live RAG Comparison with Different Vector Databases
ga
la
Be
n
va
Pa
by
ed
But first, let's see how most of the people are implementing RAG.
C
See the first part of the image below, on one hand you have OLTP systems, you have your
OLAP systems and now because you are vectorising your data, you have your vector systems.
So these three in combination will provide the full context to your LLM.
Let's look at how they do that- so on the left hand side you have the end user asking a query,
that query will be vectorised and that query vector will be sent to the vector database, and
through vector search you will receive your top k results.
Those results along with the associated meta data will be retrieved from your OLAP and OLTP
systems. Then based on the user query, will add more filters and that will then be sent to the
LLM as a prompt and then the LLM answers the user question/query.
tti
⮕ Vector-capable NoSQL - MongoDB, Redis, Cassandra, etc
⮕ Vector-capable SQL - SingleStore, ROCKET, PostgreSQL, ClickHouse, etc
ga
But then let's also understand how does your database affect your Gen AI app?
la
What all you need?
- You need reliable storage
Be
- efficient analytics
- data consistency
- vector capabilities
- scalability
- concurrency
n
va
SingleStore is built keeping all these things in mind. Let's see how.
With SingleStore, you will have all of your transactional, analytical, and vector data co-located in
Pa
one single source. So now when a end user asks a query, the GenAI app will vectorize that
query, and within a single query, you can do your vector search, you can do full-text search or
any other type of analytical filter you may want with miliseconds response times.
by
You can send all of that to the LLM as a context without any need for stitching responses
together. BTW, SingleStore started supporting vectors long back in 2017 itself. The hybrid
search feature adds an added advantage for your GenAI applications.
ed
Would you like a hands-on and step-by-step guide to understand how SingleStore performs
at
Here is the video where one of the SingleStore engineers compared RAG with most successful
DBs: https://youtu.be/xONafE5rQHk
C
tti
ga
la
Be
How Robust is Your RAG Setup? Let's Evaluate:point_down:
n
Let's evaluate using LlamaIndex : https://youtu.be/MP6hHpy213o
va
In this video, we will delve into the concept of RAG evaluation. We will evaluate the robustness
of our Retrieval-Augmented Generation (RAG) workflow, focusing on the accuracy of generated
Pa
responses.
We will start by understanding the importance of evaluation in RAG and see a simple RAG
by
workflow with different stages involved. We will then understand what happens at each stage
and how evaluation step fits in.
ed
tti
ga
la
Be
Vectorize helps you build AI apps faster and with less hassle. It automates data extraction, finds
the best vectorization strategy using RAG evaluation, and lets you quickly deploy real-time RAG
n
pipelines for your unstructured data. Your vector search indexes stay up-to-date, and it
va
integrates with your existing vector database, so you maintain full control of your data. Vectorize
handles the heavy lifting, freeing you to focus on building robust AI solutions without getting
bogged down by data management.
Pa
Meta recently released their new set of advanced models - Llama 3.1
It has three sizes: 8B, 70B, and 405B parameters. Meta AI's testing shows that Llama 3 70B
beats Gemini and Claude in most benchmarks.
Well, this is Meta’s largest ever open source AI model, and the company claims that it has
outperformed the likes of OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet on some
tti
benchmarks.
ga
I am using Llama 3.1 405B Instruct model from Fireworks AI.
la
You can access different models from here: https://fireworks.ai/models
Be
More details in the video. Please refer to my video: https://youtu.be/aJ6KNsamdZw
n
Verifying the Correctness of RAG Responses
va
Pa
by
ed
at
re
C
RAG evaluation is important because it helps ensure the effectiveness of our RAG systems.
Basically, it ensures the RAG pipeline generates coherent responses, and meets end-user
needs.
tti
ga
la
Be
n
va
Pa
Once you have a knowledge graph, you can use it to perform the retrieval augmented
ed
generation (RAG). You can do the RAG without even having vectors or vector embeddings. This
approach of having knowledge graphs is good for handling questions about things like
at
In the video, I have shown a tutorial on how to build a simple knowledge graph, store it in your
database and retrieve the entity relationships for any given user query. The same thing can be
extended to your RAG application to retrieve enhanced results/responses.
C
The only prerequisite to do this tutorial is SingleStore. Sign up & get a free account:
https://bit.ly/SingleStoreDB
Vector RAG vs. Graph RAG
tti
ga
la
Be
n
va
Pa
RAG can be implemented using either a database that supports vectors and semantic search or
a knowledge graph, each offering distinct advantages and methodologies for information
ed
retrieval and response generation. The goal remains the same with both approaches, to retrieve
the contextually relevant data/information for the user query.
at
RAG with a vector database involves converting input queries into vector
representations/embeddings and performing vector search to retrieve relevant data based on
re
their semantic similarity. The retrieved documents go through an LLM to generate the
responses. This approach is efficient for handling large-scale unstructured data and excels in
C
contexts where the relationships between data points are not explicitly defined.
In contrast, RAG with a knowledge graph uses the structured relationships and entities within
the graph to retrieve relevant information. The input query is used to perform a search within the
knowledge graph, extracting relevant entities and their relationships.
This structured data is then utilized to generate a response. Knowledge graphs are particularly
useful for applications requiring a deep understanding of the interconnections between data
points, making them ideal for domains where the relationships between entities are crucial.
You don't need a specialised database or such to do both graph RAG or vector RAG.
Well, both approaches can be possible with SingleStore, you can use it as a vector database
and also for constructing and storing knowledge graphs for graph RAG.
tti
Try SingleStore for free: https://bit.ly/SingleStoreDB
ga
Watch my recent video on enhancing RAG applications using knowledge graphs:
https://youtu.be/rCQpQeJO59A
la
RAG Evaluation Strategies
Be
n
va
Pa
by
ed
at
re
C
The field of RAG evaluation continues to evolve & it is very important for AI/ML/Data engineers
to know these concepts thoroughly.
RAG evaluation includes the evaluation of retrieval & the generation component with the
specific input text.
At a high level, RAG evaluation algorithms can be bifurcated into two categories. 1) Where the
ground truth (the ideal answer) is provided by the evaluator/user 2) Where the ground truth (the
ideal answer) is also generated by another LLM.
For the ease of understanding, the author has further classified these categories into 5
sub-categories.
1. Character based evaluation
2. Word based evaluation
3. Embedding based evaluation
tti
4. Mathematical Framework
5. Experimental based framework
ga
Let’s take a look at each of these evaluation categories:
la
1. Where the ground truth is provided by the evaluator.
Be
→ Character based evaluation algorithm:
As the name indicates, this algorithm finds a score which is the character by character
difference between the reference (ground truth) and the RAG translation output.
Step 2: Use a distance measure (like cosine similarity) to evaluate the distance between
the embeddings of the generated text and the reference text.
ed
2. Where the ground truth is also generated by LLM (LLM assisted evaluation)
→ Mathematical Framework — RAGAS Score
at
RAGAS is one of the most common and comprehensive frameworks to assess the RAG
accuracy and relevance. RAG bifurcates the evaluation from Retrieval and Generation
re
perspective.
C
tti
ga
la
Be
n
va
Pa
by
ed
RAG is not a silver bullet! It's the cheapest way to improve LLMs
at
(RAG).
If not RAG the what can we use? we can use fine-tuning and prompt engineering.
tti
Fine-tuning involves training the large language model (LLM) on a specific dataset relevant to
ga
your task. This helps the LLM understand the domain and improve its accuracy for tasks within
that domain.
la
Prompt engineering is where you focus on crafting informative prompts and instructions for the
Be
LLM. By carefully guiding the LLM with the right questions and context, you can steer it towards
generating more relevant and accurate responses without needing an external information
retrieval step.
n
Ultimately, the best alternative depends on your specific needs.
va
Take a look at my article on RAG: https://bit.ly/RAGTutorial
Pa
If you like to use a robust database for not just AI/ML applications but also for real-time
analytics, try SingleStore database.
—-----------------------------------------------------------------------------------------------------------------------
Guys, it's that time of year again, the most awaited AI conference in San Francisco, happening
on the 3rd of October 2024.
ed
at
re
C
If you are really interested in attending this conference where you will get to meet some great AI
minds in the industry, let me know. I have some huge discount coupons [100% free] I can
share with you.
My email address is [email protected]
tti
ga
la
Be
n
va
Pa
by
Thank You!!!
at
re
C