Skip to content

Regexp Separator not working OOTB with (Recursive)CharacterSplitter #28407

@jexp

Description

@jexp

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import re

text_splitter = CharacterTextSplitter(
    # Updated separator to match both uppercase and title case chapter headings
    separator="\bCHAPTER\b", # doesn't work 
    # works separator=r"\bCHAPTER\b",
    chunk_size=500, chunk_overlap = 0,
    is_separator_regex=True,
)


char_chunks = text_splitter.split_text(full_book)
print([c[0:10] for c in char_chunks])
len(char_chunks), len(char_chunks[0])

['Acknowledg', '1\nIntroduc', '2\nOrganizi']

Error Message and Stack Trace (if applicable)

Non-Working case with: separator="\bCHAPTER\b",

['Acknowledg']
(1, 2996)

Working case with r-string: separator=r"\bCHAPTER\b",

['Acknowledg', '1\nIntroduc', '2\nOrganizi']
(3, 696)

Description

we just spent two hours trying to figure out how to use recursive/character text splitter with regexp-separators

it turned out none of the docs or the code had the right information, there is no mention of r-strings anywhere in the docs and the example also doesn't have any. And it also says "interpreted as regexp" which is not true.

https://python.langchain.com/docs/how_to/recursive_text_splitter/

is_separator_regex: Whether the separator list (defaulting to ["\n\n", "\n", " ", ""]) should be interpreted as regex.

We thought strings are turned automatically into regexps, but it doesn't seem so, it only escapes non-regexp-strings if is_separator_regex is False

see https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/character.py#L24-L93

so the solution was 🤯 to use r-strings r"^CHAPTER \d+$" otherwise you get only a single chunk because your regexp is not found as a separator.

Not sure how any of the language stuff that has regexpes actuallly works?

e.g. Markdown
https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/character.py#L440-L443

System Info

System Information
------------------
> OS:  Darwin
> OS Version:  Darwin Kernel Version 23.6.0: Wed Jul 31 20:49:39 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T6000
> Python Version:  3.11.10 (main, Sep  7 2024, 01:03:31) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information
-------------------
> langchain_core: 0.2.35
> langchain: 0.2.14
> langchain_community: 0.2.12
> langsmith: 0.1.104
> langchain-genai-website: Installed. No version info available.
> langchain_anthropic: 0.1.13
> langchain_aws: 0.1.6
> langchain_cli: 0.0.22
> langchain_experimental: 0.0.64
> langchain_fireworks: 0.1.3
> langchain_google_genai: 1.0.4
> langchain_google_vertexai: 1.0.4
> langchain_groq: 0.1.5
> langchain_openai: 0.1.22
> langchain_text_splitters: 0.2.2
> langserve: 0.1.1
...
> tomlkit: 0.12.0
> typer[all]: Installed. No version info available.
> typing-extensions: 4.12.2
> uvicorn: 0.30.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions