How to create a custom Retriever
Overview
Many LLM applications involve retrieving information from external data sources using a Retriever.
A retriever is responsible for retrieving a list of relevant Documents to a given user query
.
The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).
Interface
To create your own retriever, you need to extend the BaseRetriever
class and implement the following methods:
Method | Description | Required/Optional |
---|---|---|
_get_relevant_documents | Get documents relevant to a query. | Required |
_aget_relevant_documents | Implement to provide async native support. | Optional |
The logic inside of _get_relevant_documents
can involve arbitrary calls to a database or to the web using requests.
By inherting from BaseRetriever
, your retriever automatically becomes a LangChain Runnable and will gain the standard Runnable
functionality out of the box!
You can use a RunnableLambda
or RunnableGenerator
to implement a retriever.
The main benefit of implementing a retriever as a BaseRetriever
vs. a RunnableLambda
(a custom runnable function) is that a BaseRetriever
is a well
known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference
is that a BaseRetriever
will behave slightly differently from RunnableLambda
in some APIs; e.g., the start
event
in astream_events
API will be on_retriever_start
instead of on_chain_start
.
Example
Let's implement a toy retriever that returns all documents whose text contains the text in the user query.
from typing import List
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
class ToyRetriever(BaseRetriever):
"""A toy retriever that contains the top k documents that contain the user query.
This retriever only implements the sync method _get_relevant_documents.
If the retriever were to involve file access or network access, it could benefit
from a native async implementation of `_aget_relevant_documents`.
As usual, with Runnables, there's a default async implementation that's provided
that delegates to the sync implementation running on another thread.
"""
documents: List[Document]
"""List of documents to retrieve from."""
k: int
"""Number of top results to return"""
def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
"""Sync implementations for retriever."""
matching_documents = []
for document in documents:
if len(matching_documents) > self.k:
return matching_documents
if query.lower() in document.page_content.lower():
matching_documents.append(document)
return matching_documents
# Optional: Provide a more efficient native implementation by overriding
# _aget_relevant_documents
# async def _aget_relevant_documents(
# self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
# ) -> List[Document]:
# """Asynchronously get documents relevant to a query.
# Args:
# query: String to find relevant documents for
# run_manager: The callbacks handler to use
# Returns:
# List of relevant documents
# """
Test it 🧪
documents = [
Document(
page_content="Dogs are great companions, known for their loyalty and friendliness.",
metadata={"type": "dog", "trait": "loyalty"},
),
Document(
page_content="Cats are independent pets that often enjoy their own space.",
metadata={"type": "cat", "trait": "independence"},
),
Document(
page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
metadata={"type": "fish", "trait": "low maintenance"},
),
Document(
page_content="Parrots are intelligent birds capable of mimicking human speech.",
metadata={"type": "bird", "trait": "intelligence"},
),
Document(
page_content="Rabbits are social animals that need plenty of space to hop around.",
metadata={"type": "rabbit", "trait": "social"},
),
]
retriever = ToyRetriever(documents=documents, k=3)
retriever.invoke("that")
[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]
It's a runnable so it'll benefit from the standard Runnable Interface! 🤩
await retriever.ainvoke("that")
[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]
retriever.batch(["dog", "cat"])
[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],
[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]
async for event in retriever.astream_events("bar", version="v1"):
print(event)
{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}
{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}
{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}
Contributing
We appreciate contributions of interesting retrievers!
Here's a checklist to help make sure your contribution gets added to LangChain:
Documentation:
- The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the API Reference.
- The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)
Tests:
- Add unit or integration tests to verify that
invoke
andainvoke
work.
Optimizations:
If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!
- Provide a native async implementation of
_aget_relevant_documents
(used byainvoke
)