Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SemanticDoubleMergingSplitterNodeParser giving error -> IndexError: list index out of range #17032

Open
amanchaudhary-95 opened this issue Nov 22, 2024 · 1 comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@amanchaudhary-95
Copy link

Bug Description

I'm trying to split some markdown files using SimpleDirectoryReader and SemanticDoubleMergingSplitterNodeParser. The code is below:

from llama_index.core.node_parser import SemanticDoubleMergingSplitterNodeParser, LanguageConfig
config = LanguageConfig(language="english", spacy_model="en_core_web_md")
index = VectorStoreIndex.from_documents(documents,
                                        show_progress=True,
                                        embed_model=embed_model,
                                        storage_context=storage_context,
                                        transformations=[SemanticDoubleMergingSplitterNodeParser(language_config=config,
                                                                                                 max_chunk_size=2000)])

It is giving IndexError: list index out of range error. It seems the issue is in ..llama_index\core\node_parser\text\semantic_double_merging_splitter.py line 216
chunk = sentences[0] # ""

Error Log

{
	"name": "IndexError",
	"message": "list index out of range",
	"stack": "---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[5], line 4
      1 from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser, SemanticDoubleMergingSplitterNodeParser, LanguageConfig
      2 # config = LanguageConfig(language=\"english\", spacy_model=\"en_core_web_md\")
----> 4 index = VectorStoreIndex.from_documents(documents, show_progress=True, embed_model=embed_model, storage_context=storage_context,
      5                                         transformations=[SemanticDoubleMergingSplitterNodeParser(
      6                                                                                                 max_chunk_size=2000)])
      8                                         # transformations=[SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model)])

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\indices\\base.py:112, in BaseIndex.from_documents(cls, documents, storage_context, show_progress, callback_manager, transformations, **kwargs)
    109 for doc in documents:
    110     docstore.set_document_hash(doc.get_doc_id(), doc.hash)
--> 112 nodes = run_transformations(
    113     documents,  # type: ignore
    114     transformations,
    115     show_progress=show_progress,
    116     **kwargs,
    117 )
    119 return cls(
    120     nodes=nodes,
    121     storage_context=storage_context,
   (...)
    125     **kwargs,
    126 )

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\ingestion\\pipeline.py:100, in run_transformations(nodes, transformations, in_place, cache, cache_collection, **kwargs)
     98             cache.put(hash, nodes, collection=cache_collection)
     99     else:
--> 100         nodes = transform(nodes, **kwargs)
    102 return nodes

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\instrumentation\\dispatcher.py:311, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    308             _logger.debug(f\"Failed to reset active_span_id: {e}\")
    310 try:
--> 311     result = func(*args, **kwargs)
    312     if isinstance(result, asyncio.Future):
    313         # If the result is a Future, wrap it
    314         new_future = asyncio.ensure_future(result)

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\interface.py:193, in NodeParser.__call__(self, nodes, **kwargs)
    192 def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> List[BaseNode]:
--> 193     return self.get_nodes_from_documents(nodes, **kwargs)

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\interface.py:165, in NodeParser.get_nodes_from_documents(self, documents, show_progress, **kwargs)
    160 doc_id_to_document = {doc.id_: doc for doc in documents}
    162 with self.callback_manager.event(
    163     CBEventType.NODE_PARSING, payload={EventPayload.DOCUMENTS: documents}
    164 ) as event:
--> 165     nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
    166     nodes = self._postprocess_parsed_nodes(nodes, doc_id_to_document)
    168     event.on_end({EventPayload.NODES: nodes})

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\instrumentation\\dispatcher.py:311, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    308             _logger.debug(f\"Failed to reset active_span_id: {e}\")
    310 try:
--> 311     result = func(*args, **kwargs)
    312     if isinstance(result, asyncio.Future):
    313         # If the result is a Future, wrap it
    314         new_future = asyncio.ensure_future(result)

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\text\\semantic_double_merging_splitter.py:187, in SemanticDoubleMergingSplitterNodeParser._parse_nodes(self, nodes, show_progress, **kwargs)
    184 nodes_with_progress = get_tqdm_iterable(nodes, show_progress, \"Parsing nodes\")
    186 for node in nodes_with_progress:
--> 187     nodes = self.build_semantic_nodes_from_documents([node])
    188     all_nodes.extend(nodes)
    189 return all_nodes

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\text\\semantic_double_merging_splitter.py:202, in SemanticDoubleMergingSplitterNodeParser.build_semantic_nodes_from_documents(self, documents)
    200 sentences = self.sentence_splitter(text)
    201 sentences = [s.strip() for s in sentences]
--> 202 initial_chunks = self._create_initial_chunks(sentences)
    203 chunks = self._merge_initial_chunks(initial_chunks)
    205 nodes = build_nodes_from_splits(
    206     chunks,
    207     doc,
    208     id_func=self.id_func,
    209 )

File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\text\\semantic_double_merging_splitter.py:216, in SemanticDoubleMergingSplitterNodeParser._create_initial_chunks(self, sentences)
    214 def _create_initial_chunks(self, sentences: List[str]) -> List[str]:
    215     initial_chunks: List[str] = []
--> 216     chunk = sentences[0]  # \"\"
    217     new = True
    219     assert self.language_config.nlp is not None

IndexError: list index out of range"
}

Version

llama-index==0.12.1

Steps to Reproduce

from llama_index.core.node_parser import SemanticDoubleMergingSplitterNodeParser, LanguageConfig
config = LanguageConfig(language="english", spacy_model="en_core_web_md")
index = VectorStoreIndex.from_documents(documents,
                                        show_progress=True,
                                        embed_model=embed_model,
                                        storage_context=storage_context,
                                        transformations=[SemanticDoubleMergingSplitterNodeParser(language_config=config,
                                                                                                 max_chunk_size=2000)])

Relevant Logs/Tracbacks

No response

@amanchaudhary-95 amanchaudhary-95 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Nov 22, 2024
Copy link

dosubot bot commented Nov 22, 2024

Hey there, @amanchaudhary-95! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this together!

The IndexError: list index out of range error you're encountering in the SemanticDoubleMergingSplitterNodeParser is likely due to the sentences list being empty when the _create_initial_chunks method tries to access its first element. This can happen if the sentence_splitter method returns an empty list, which might occur if the input text is empty or if the sentence splitting logic fails to identify any sentences in the text.

Here are a few steps you can take to troubleshoot and potentially resolve this issue:

  1. Check Input Text: Ensure that the documents you're processing are not empty and contain text that can be split into sentences. You might want to add some logging or print statements to verify the content of the documents before they are processed.

  2. Sentence Splitter: Verify that the sentence_splitter is correctly configured and functioning as expected. You can test it separately with sample text to ensure it returns a list of sentences.

  3. Error Handling: Consider adding error handling around the sentence splitting and chunk creation process to catch cases where the sentences list is empty. This can help prevent the IndexError and allow you to handle such cases gracefully.

  4. Debugging: Add debugging statements in the build_semantic_nodes_from_documents method to check the output of the sentence_splitter and ensure that it is producing the expected results.

By addressing these areas, you should be able to identify the root cause of the issue and implement a solution to prevent the IndexError.

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
Status: Todo
Development

No branches or pull requests

1 participant