-
Notifications
You must be signed in to change notification settings - Fork 20.4k
Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import re
text_splitter = CharacterTextSplitter(
# Updated separator to match both uppercase and title case chapter headings
separator="\bCHAPTER\b", # doesn't work
# works separator=r"\bCHAPTER\b",
chunk_size=500, chunk_overlap = 0,
is_separator_regex=True,
)
char_chunks = text_splitter.split_text(full_book)
print([c[0:10] for c in char_chunks])
len(char_chunks), len(char_chunks[0])['Acknowledg', '1\nIntroduc', '2\nOrganizi']
Error Message and Stack Trace (if applicable)
Non-Working case with: separator="\bCHAPTER\b",
['Acknowledg']
(1, 2996)
Working case with r-string: separator=r"\bCHAPTER\b",
['Acknowledg', '1\nIntroduc', '2\nOrganizi']
(3, 696)
Description
we just spent two hours trying to figure out how to use recursive/character text splitter with regexp-separators
it turned out none of the docs or the code had the right information, there is no mention of r-strings anywhere in the docs and the example also doesn't have any. And it also says "interpreted as regexp" which is not true.
https://python.langchain.com/docs/how_to/recursive_text_splitter/
is_separator_regex: Whether the separator list (defaulting to["\n\n", "\n", " ", ""]) should be interpreted as regex.
We thought strings are turned automatically into regexps, but it doesn't seem so, it only escapes non-regexp-strings if is_separator_regex is False
so the solution was 🤯 to use r-strings r"^CHAPTER \d+$" otherwise you get only a single chunk because your regexp is not found as a separator.
Not sure how any of the language stuff that has regexpes actuallly works?
e.g. Markdown
https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/character.py#L440-L443
System Info
System Information
------------------
> OS: Darwin
> OS Version: Darwin Kernel Version 23.6.0: Wed Jul 31 20:49:39 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T6000
> Python Version: 3.11.10 (main, Sep 7 2024, 01:03:31) [Clang 15.0.0 (clang-1500.3.9.4)]
Package Information
-------------------
> langchain_core: 0.2.35
> langchain: 0.2.14
> langchain_community: 0.2.12
> langsmith: 0.1.104
> langchain-genai-website: Installed. No version info available.
> langchain_anthropic: 0.1.13
> langchain_aws: 0.1.6
> langchain_cli: 0.0.22
> langchain_experimental: 0.0.64
> langchain_fireworks: 0.1.3
> langchain_google_genai: 1.0.4
> langchain_google_vertexai: 1.0.4
> langchain_groq: 0.1.5
> langchain_openai: 0.1.22
> langchain_text_splitters: 0.2.2
> langserve: 0.1.1
...
> tomlkit: 0.12.0
> typer[all]: Installed. No version info available.
> typing-extensions: 4.12.2
> uvicorn: 0.30.6