You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to split some markdown files using SimpleDirectoryReader and SemanticDoubleMergingSplitterNodeParser. The code is below:
from llama_index.core.node_parser import SemanticDoubleMergingSplitterNodeParser, LanguageConfig
config = LanguageConfig(language="english", spacy_model="en_core_web_md")
index = VectorStoreIndex.from_documents(documents,
show_progress=True,
embed_model=embed_model,
storage_context=storage_context,
transformations=[SemanticDoubleMergingSplitterNodeParser(language_config=config,
max_chunk_size=2000)])
It is giving IndexError: list index out of range error. It seems the issue is in ..llama_index\core\node_parser\text\semantic_double_merging_splitter.py line 216 chunk = sentences[0] # ""
Error Log
{
"name": "IndexError",
"message": "list index out of range",
"stack": "---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[5], line 4
1 from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser, SemanticDoubleMergingSplitterNodeParser, LanguageConfig
2 # config = LanguageConfig(language=\"english\", spacy_model=\"en_core_web_md\")
----> 4 index = VectorStoreIndex.from_documents(documents, show_progress=True, embed_model=embed_model, storage_context=storage_context,
5 transformations=[SemanticDoubleMergingSplitterNodeParser(
6 max_chunk_size=2000)])
8 # transformations=[SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model)])
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\indices\\base.py:112, in BaseIndex.from_documents(cls, documents, storage_context, show_progress, callback_manager, transformations, **kwargs)
109 for doc in documents:
110 docstore.set_document_hash(doc.get_doc_id(), doc.hash)
--> 112 nodes = run_transformations(
113 documents, # type: ignore
114 transformations,
115 show_progress=show_progress,
116 **kwargs,
117 )
119 return cls(
120 nodes=nodes,
121 storage_context=storage_context,
(...)
125 **kwargs,
126 )
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\ingestion\\pipeline.py:100, in run_transformations(nodes, transformations, in_place, cache, cache_collection, **kwargs)
98 cache.put(hash, nodes, collection=cache_collection)
99 else:
--> 100 nodes = transform(nodes, **kwargs)
102 return nodes
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\instrumentation\\dispatcher.py:311, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
308 _logger.debug(f\"Failed to reset active_span_id: {e}\")
310 try:
--> 311 result = func(*args, **kwargs)
312 if isinstance(result, asyncio.Future):
313 # If the result is a Future, wrap it
314 new_future = asyncio.ensure_future(result)
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\interface.py:193, in NodeParser.__call__(self, nodes, **kwargs)
192 def __call__(self, nodes: Sequence[BaseNode], **kwargs: Any) -> List[BaseNode]:
--> 193 return self.get_nodes_from_documents(nodes, **kwargs)
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\interface.py:165, in NodeParser.get_nodes_from_documents(self, documents, show_progress, **kwargs)
160 doc_id_to_document = {doc.id_: doc for doc in documents}
162 with self.callback_manager.event(
163 CBEventType.NODE_PARSING, payload={EventPayload.DOCUMENTS: documents}
164 ) as event:
--> 165 nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
166 nodes = self._postprocess_parsed_nodes(nodes, doc_id_to_document)
168 event.on_end({EventPayload.NODES: nodes})
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\\instrumentation\\dispatcher.py:311, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
308 _logger.debug(f\"Failed to reset active_span_id: {e}\")
310 try:
--> 311 result = func(*args, **kwargs)
312 if isinstance(result, asyncio.Future):
313 # If the result is a Future, wrap it
314 new_future = asyncio.ensure_future(result)
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\text\\semantic_double_merging_splitter.py:187, in SemanticDoubleMergingSplitterNodeParser._parse_nodes(self, nodes, show_progress, **kwargs)
184 nodes_with_progress = get_tqdm_iterable(nodes, show_progress, \"Parsing nodes\")
186 for node in nodes_with_progress:
--> 187 nodes = self.build_semantic_nodes_from_documents([node])
188 all_nodes.extend(nodes)
189 return all_nodes
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\text\\semantic_double_merging_splitter.py:202, in SemanticDoubleMergingSplitterNodeParser.build_semantic_nodes_from_documents(self, documents)
200 sentences = self.sentence_splitter(text)
201 sentences = [s.strip() for s in sentences]
--> 202 initial_chunks = self._create_initial_chunks(sentences)
203 chunks = self._merge_initial_chunks(initial_chunks)
205 nodes = build_nodes_from_splits(
206 chunks,
207 doc,
208 id_func=self.id_func,
209 )
File d:\\.venv\\LLM\\lib\\site-packages\\llama_index\\core\
ode_parser\\text\\semantic_double_merging_splitter.py:216, in SemanticDoubleMergingSplitterNodeParser._create_initial_chunks(self, sentences)
214 def _create_initial_chunks(self, sentences: List[str]) -> List[str]:
215 initial_chunks: List[str] = []
--> 216 chunk = sentences[0] # \"\"
217 new = True
219 assert self.language_config.nlp is not None
IndexError: list index out of range"
}
Version
llama-index==0.12.1
Steps to Reproduce
from llama_index.core.node_parser import SemanticDoubleMergingSplitterNodeParser, LanguageConfig
config = LanguageConfig(language="english", spacy_model="en_core_web_md")
index = VectorStoreIndex.from_documents(documents,
show_progress=True,
embed_model=embed_model,
storage_context=storage_context,
transformations=[SemanticDoubleMergingSplitterNodeParser(language_config=config,
max_chunk_size=2000)])
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered:
Hey there, @amanchaudhary-95! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this together!
The IndexError: list index out of range error you're encountering in the SemanticDoubleMergingSplitterNodeParser is likely due to the sentences list being empty when the _create_initial_chunks method tries to access its first element. This can happen if the sentence_splitter method returns an empty list, which might occur if the input text is empty or if the sentence splitting logic fails to identify any sentences in the text.
Here are a few steps you can take to troubleshoot and potentially resolve this issue:
Check Input Text: Ensure that the documents you're processing are not empty and contain text that can be split into sentences. You might want to add some logging or print statements to verify the content of the documents before they are processed.
Sentence Splitter: Verify that the sentence_splitter is correctly configured and functioning as expected. You can test it separately with sample text to ensure it returns a list of sentences.
Error Handling: Consider adding error handling around the sentence splitting and chunk creation process to catch cases where the sentences list is empty. This can help prevent the IndexError and allow you to handle such cases gracefully.
Debugging: Add debugging statements in the build_semantic_nodes_from_documents method to check the output of the sentence_splitter and ensure that it is producing the expected results.
By addressing these areas, you should be able to identify the root cause of the issue and implement a solution to prevent the IndexError.
Bug Description
I'm trying to split some markdown files using
SimpleDirectoryReader
andSemanticDoubleMergingSplitterNodeParser
. The code is below:It is giving
IndexError: list index out of range
error. It seems the issue is in..llama_index\core\node_parser\text\semantic_double_merging_splitter.py
line216
chunk = sentences[0] # ""
Error Log
Version
llama-index==0.12.1
Steps to Reproduce
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: