Skip to content

feat(pipeline): implement page chunking for converting large documents (parallelizable)#3162

Open
div-dhingra wants to merge 3 commits intodocling-project:mainfrom
div-dhingra:feat/page-chunking
Open

feat(pipeline): implement page chunking for converting large documents (parallelizable)#3162
div-dhingra wants to merge 3 commits intodocling-project:mainfrom
div-dhingra:feat/page-chunking

Conversation

@div-dhingra
Copy link

Summary:

Context:
When converting massive documents (e.g., 1,000+ page PDFs), Docling attempts to process the entire document at once. This frequently leads to memory overflow (Out-Of-Memory / OOM) errors or heavy swap usage. While Docling currently has a PageChunker, it operates post-conversion on fully parsed DoclingDocuments, which doesn't protect the system from memory exhaustion during the actual parsing pipeline.

This PR:
Introduces a pre-conversion page chunking mechanism at the input level. By configuring page_chunk_size, massive documents are logically partitioned into smaller, bite-sized InputDocument chunks before entering the conversion pipeline.

Because these chunks are treated internally as independent documents, they can be seamlessly parallelized, allowing Docling to stream results iteratively and garbage-collect heavy ML tensors on the fly.

Key Technical Changes:

  • Pipeline Settings: Adds page_chunk_size to BatchConcurrencySettings. If None (default), documents are processed entirely at once to preserve existing behavior.
  • Concurrent Execution: Chunks are dynamically yielded to the chunkify generator in DocumentConverter. This allows doc_batch_concurrency to pick up independent chunks of a single document and process them in parallel across multiple workers.
  • Memory Stream Safety: Safely handles memory streams (e.g., direct S3 downloads via BytesIO). It preferentially materializes streams to a temporary local file so each thread can spawn a thread-safe parser (independent backends). If materialization fails, it falls back to a new ReferenceCountedBackend wrapper to prevent premature C++ garbage collection crashes when multiple threads share the same PDFium parser (shared backend).
  • Testing: Adds a robust parameterized test matrix in test_page_chunking.py to validate 12 core scenarios (Local vs. Memory Stream vs. Mixed) × (Concurrent vs. Sequential) × (Chunked vs. Unchunked).

Issue resolved by this Pull Request: #3088

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 20, 2026

DCO Check Passed

Thanks @div-dhingra, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Mar 20, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link

dosubot bot commented Mar 20, 2026

Related Documentation

2 document(s) may need updating based on files changed in this PR:

Docling

How can I set up and run docling-serve on a MacBook Pro using Docker, and what performance and stability considerations should I be aware of?
View Suggested Changes
@@ -31,6 +31,18 @@
 **Stability and Troubleshooting:**
 - Use ARM64 images and ensure all models are present to avoid runtime errors.
 - Memory management has been improved through the integration of mimalloc (a high-performance memory allocator), which significantly reduces memory growth during document processing. While earlier versions experienced memory leaks, this optimization addresses those concerns. Still, monitor memory usage for long-running sessions and restart the container if you observe unusual resource consumption.
+
+  **Processing Large Documents with Page Chunking:**
+  
+  For processing exceptionally large documents (e.g., 1000+ page PDFs) within memory constraints, configure the `page_chunk_size` option. This splits large PDFs into smaller page chunks during conversion, preventing Out-Of-Memory (OOM) errors without requiring additional memory allocation beyond the recommended 8GB minimum.
+  
+  Page chunking can be configured in two ways:
+  - In `PipelineOptions`: Set `page_chunk_size` when creating the pipeline options
+  - In `BatchConcurrencySettings`: Set `settings.perf.page_chunk_size`
+  
+  When enabled, chunks are treated as independent documents and processed in parallel based on `doc_batch_concurrency` settings. Consider increasing concurrency when using page chunking to process multiple chunks simultaneously (e.g., for a 500-page document with `page_chunk_size=50`, you could use `doc_batch_concurrency=10` to process all 10 chunks in parallel).
+  
+  Example values: Start with 50 or 100 pages per chunk and adjust based on your document size and available memory. Smaller chunk sizes reduce peak memory usage but increase overhead; larger chunks process faster but require more memory per chunk.
 
   **Memory Debugging Endpoints:**
   

[Accept] [Decline]

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -16,6 +16,7 @@
     - `generate_page_images`, `generate_picture_images`: Extract page/picture images
     - `force_backend_text`: Force backend text extraction
     - `layout_custom_config`, `table_structure_custom_config`: Custom model configs for layout/table structure (see Table Structure Models section below)
+    - `page_chunk_size` (default: None): Number of pages to process as a single chunk. When processing large PDFs (e.g., 1000+ pages), this limits memory usage and allows streaming chunked results instead of waiting for the entire document. If None, the entire document is processed at once. See Page Chunking section below for details.
     - Additional options for chart extraction, picture description, and more
 
 ---
@@ -67,6 +68,37 @@
 result = converter.convert(source="scanned.pdf")
 text = result.document.export_to_text(traverse_pictures=True)
 markdown = result.document.export_to_markdown(traverse_pictures=True)
+```
+
+- **Page Chunking for Large Documents**: The `page_chunk_size` option enables processing large documents (e.g., 1000+ page PDFs) in chunks to prevent memory overflow (OOM) errors. When configured, the document is split into page chunks that are treated as independent documents and processed based on `doc_batch_concurrency`. Key behaviors:
+    - **Using `convert()` vs `convert_all()`**: When `page_chunk_size` is enabled, `convert()` only returns the first chunk. Use `convert_all()` to stream all chunks of the document.
+    - **Parallelization**: Chunks are treated as independent documents, so increase `doc_batch_concurrency` to process multiple chunks in parallel (e.g., 500 total pages / 50 page chunks = 10 concurrency for 10 chunks).
+    - **Default Behavior**: If `page_chunk_size` is not set (None), the entire document is processed at once.
+
+```python
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.settings import settings
+
+# Configure chunking for large PDFs
+pipeline_options = PdfPipelineOptions()
+pipeline_options.page_chunk_size = 50  # Process 50 pages per chunk
+
+# Increase concurrency to parallelize chunks
+settings.perf.doc_batch_concurrency = 10  # Process 10 chunks in parallel
+settings.perf.doc_batch_size = 10
+
+converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    }
+)
+
+# Use convert_all() to stream all chunks
+for result in converter.convert_all(["large_document.pdf"]):
+    print(f"Processed chunk: pages {result.input.limits.page_range}")
+    # Process each chunk as it completes
 ```
 
 - **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8). Refer to the Python SDK documentation for usage of `format_options`. See the API reference for details on new preset/custom config fields and deprecated options.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@div-dhingra div-dhingra changed the title Feat(pipeline): implement page chunking for converting large documents (parallelizable) feat(pipeline): implement page chunking for converting large documents (parallelizable) Mar 20, 2026
@codecov
Copy link

codecov bot commented Mar 21, 2026

Codecov Report

❌ Patch coverage is 34.57944% with 70 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/document_converter.py 32.03% 70 Missing ⚠️

📢 Thoughts on this report? Let us know!

I, Divpreet Dhingra <[email protected]>, hereby add my Signed-off-by to this commit: 4d41038
I, Divpreet Dhingra <[email protected]>, hereby add my Signed-off-by to this commit: ad9d79c

Signed-off-by: Divpreet Dhingra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant