feat(pipeline): implement page chunking for converting large documents (parallelizable)#3162
Open
div-dhingra wants to merge 3 commits intodocling-project:mainfrom
Open
feat(pipeline): implement page chunking for converting large documents (parallelizable)#3162div-dhingra wants to merge 3 commits intodocling-project:mainfrom
div-dhingra wants to merge 3 commits intodocling-project:mainfrom
Conversation
Contributor
|
✅ DCO Check Passed Thanks @div-dhingra, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 2 document(s) may need updating based on files changed in this PR: Docling How can I set up and run docling-serve on a MacBook Pro using Docker, and what performance and stability considerations should I be aware of?View Suggested Changes@@ -31,6 +31,18 @@
**Stability and Troubleshooting:**
- Use ARM64 images and ensure all models are present to avoid runtime errors.
- Memory management has been improved through the integration of mimalloc (a high-performance memory allocator), which significantly reduces memory growth during document processing. While earlier versions experienced memory leaks, this optimization addresses those concerns. Still, monitor memory usage for long-running sessions and restart the container if you observe unusual resource consumption.
+
+ **Processing Large Documents with Page Chunking:**
+
+ For processing exceptionally large documents (e.g., 1000+ page PDFs) within memory constraints, configure the `page_chunk_size` option. This splits large PDFs into smaller page chunks during conversion, preventing Out-Of-Memory (OOM) errors without requiring additional memory allocation beyond the recommended 8GB minimum.
+
+ Page chunking can be configured in two ways:
+ - In `PipelineOptions`: Set `page_chunk_size` when creating the pipeline options
+ - In `BatchConcurrencySettings`: Set `settings.perf.page_chunk_size`
+
+ When enabled, chunks are treated as independent documents and processed in parallel based on `doc_batch_concurrency` settings. Consider increasing concurrency when using page chunking to process multiple chunks simultaneously (e.g., for a 500-page document with `page_chunk_size=50`, you could use `doc_batch_concurrency=10` to process all 10 chunks in parallel).
+
+ Example values: Start with 50 or 100 pages per chunk and adjust based on your document size and available memory. Smaller chunk sizes reduce peak memory usage but increase overhead; larger chunks process faster but require more memory per chunk.
**Memory Debugging Endpoints:**
What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -16,6 +16,7 @@
- `generate_page_images`, `generate_picture_images`: Extract page/picture images
- `force_backend_text`: Force backend text extraction
- `layout_custom_config`, `table_structure_custom_config`: Custom model configs for layout/table structure (see Table Structure Models section below)
+ - `page_chunk_size` (default: None): Number of pages to process as a single chunk. When processing large PDFs (e.g., 1000+ pages), this limits memory usage and allows streaming chunked results instead of waiting for the entire document. If None, the entire document is processed at once. See Page Chunking section below for details.
- Additional options for chart extraction, picture description, and more
---
@@ -67,6 +68,37 @@
result = converter.convert(source="scanned.pdf")
text = result.document.export_to_text(traverse_pictures=True)
markdown = result.document.export_to_markdown(traverse_pictures=True)
+```
+
+- **Page Chunking for Large Documents**: The `page_chunk_size` option enables processing large documents (e.g., 1000+ page PDFs) in chunks to prevent memory overflow (OOM) errors. When configured, the document is split into page chunks that are treated as independent documents and processed based on `doc_batch_concurrency`. Key behaviors:
+ - **Using `convert()` vs `convert_all()`**: When `page_chunk_size` is enabled, `convert()` only returns the first chunk. Use `convert_all()` to stream all chunks of the document.
+ - **Parallelization**: Chunks are treated as independent documents, so increase `doc_batch_concurrency` to process multiple chunks in parallel (e.g., 500 total pages / 50 page chunks = 10 concurrency for 10 chunks).
+ - **Default Behavior**: If `page_chunk_size` is not set (None), the entire document is processed at once.
+
+```python
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.settings import settings
+
+# Configure chunking for large PDFs
+pipeline_options = PdfPipelineOptions()
+pipeline_options.page_chunk_size = 50 # Process 50 pages per chunk
+
+# Increase concurrency to parallelize chunks
+settings.perf.doc_batch_concurrency = 10 # Process 10 chunks in parallel
+settings.perf.doc_batch_size = 10
+
+converter = DocumentConverter(
+ format_options={
+ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+ }
+)
+
+# Use convert_all() to stream all chunks
+for result in converter.convert_all(["large_document.pdf"]):
+ print(f"Processed chunk: pages {result.input.limits.page_range}")
+ # Process each chunk as it completes
```
- **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8). Refer to the Python SDK documentation for usage of `format_options`. See the API reference for details on new preset/custom config fields and deprecated options.Note: You must be authenticated to accept/decline updates. |
e28e8ec to
02ea9d6
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
I, Divpreet Dhingra <[email protected]>, hereby add my Signed-off-by to this commit: 4d41038 I, Divpreet Dhingra <[email protected]>, hereby add my Signed-off-by to this commit: ad9d79c Signed-off-by: Divpreet Dhingra <[email protected]>
02ea9d6 to
7269f43
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Context:
When converting massive documents (e.g., 1,000+ page PDFs), Docling attempts to process the entire document at once. This frequently leads to memory overflow (Out-Of-Memory / OOM) errors or heavy swap usage. While Docling currently has a
PageChunker, it operates post-conversion on fully parsed DoclingDocuments, which doesn't protect the system from memory exhaustion during the actual parsing pipeline.This PR:
Introduces a pre-conversion page chunking mechanism at the input level. By configuring
page_chunk_size, massive documents are logically partitioned into smaller, bite-sized InputDocument chunks before entering the conversion pipeline.Because these chunks are treated internally as independent documents, they can be seamlessly parallelized, allowing Docling to stream results iteratively and garbage-collect heavy ML tensors on the fly.
Key Technical Changes:
page_chunk_sizetoBatchConcurrencySettings. If None (default), documents are processed entirely at once to preserve existing behavior.DocumentConverter. This allowsdoc_batch_concurrencyto pick up independent chunks of a single document and process them in parallel across multiple workers.ReferenceCountedBackendwrapper to prevent premature C++ garbage collection crashes when multiple threads share the same PDFium parser (shared backend).test_page_chunking.pyto validate 12 core scenarios (Local vs. Memory Stream vs. Mixed) × (Concurrent vs. Sequential) × (Chunked vs. Unchunked).Issue resolved by this Pull Request: #3088
Checklist: