feat(pipeline): implement page chunking for converting large documents (parallelizable) by div-dhingra · Pull Request #3162 · docling-project/docling

div-dhingra · 2026-03-20T23:31:43Z

Summary:

Context:
When converting massive documents (e.g., 1,000+ page PDFs), Docling attempts to process the entire document at once. This frequently leads to memory overflow (Out-Of-Memory / OOM) errors or heavy swap usage. While Docling currently has a PageChunker, it operates post-conversion on fully parsed DoclingDocuments, which doesn't protect the system from memory exhaustion during the actual parsing pipeline.

This PR:
Introduces a pre-conversion page chunking mechanism at the input level. By configuring page_chunk_size, massive documents are logically partitioned into smaller, bite-sized InputDocument chunks before entering the conversion pipeline.

Because these chunks are treated internally as independent documents, they can be seamlessly parallelized, allowing Docling to stream results iteratively and garbage-collect heavy ML tensors on the fly.

Key Technical Changes:

Pipeline Settings: Adds page_chunk_size to BatchConcurrencySettings. If None (default), documents are processed entirely at once to preserve existing behavior.
Concurrent Execution: Chunks are dynamically yielded to the chunkify generator in DocumentConverter. This allows doc_batch_concurrency to pick up independent chunks of a single document and process them in parallel across multiple workers.
Memory Stream Safety: Safely handles memory streams (e.g., direct S3 downloads via BytesIO). It preferentially materializes streams to a temporary local file so each thread can spawn a thread-safe parser (independent backends). If materialization fails, it falls back to a new ReferenceCountedBackend wrapper to prevent premature C++ garbage collection crashes when multiple threads share the same PDFium parser (shared backend).
Testing: Adds a robust parameterized test matrix in test_page_chunking.py to validate 12 core scenarios (Local vs. Memory Stream vs. Mixed) × (Concurrent vs. Sequential) × (Chunked vs. Unchunked).

Issue resolved by this Pull Request: #3088

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

…s (parallelizable)

github-actions · 2026-03-20T23:31:53Z

✅ DCO Check Passed

Thanks @div-dhingra, all your commits are properly signed off. 🎉

mergify · 2026-03-20T23:32:17Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-03-20T23:34:32Z

Related Documentation

2 document(s) may need updating based on files changed in this PR:

Docling

How can I set up and run docling-serve on a MacBook Pro using Docker, and what performance and stability considerations should I be aware of?

View Suggested Changes

@@ -31,6 +31,18 @@
 **Stability and Troubleshooting:**
 - Use ARM64 images and ensure all models are present to avoid runtime errors.
 - Memory management has been improved through the integration of mimalloc (a high-performance memory allocator), which significantly reduces memory growth during document processing. While earlier versions experienced memory leaks, this optimization addresses those concerns. Still, monitor memory usage for long-running sessions and restart the container if you observe unusual resource consumption.
+
+  **Processing Large Documents with Page Chunking:**
+  
+  For processing exceptionally large documents (e.g., 1000+ page PDFs) within memory constraints, configure the `page_chunk_size` option. This splits large PDFs into smaller page chunks during conversion, preventing Out-Of-Memory (OOM) errors without requiring additional memory allocation beyond the recommended 8GB minimum.
+  
+  Page chunking can be configured in two ways:
+  - In `PipelineOptions`: Set `page_chunk_size` when creating the pipeline options
+  - In `BatchConcurrencySettings`: Set `settings.perf.page_chunk_size`
+  
+  When enabled, chunks are treated as independent documents and processed in parallel based on `doc_batch_concurrency` settings. Consider increasing concurrency when using page chunking to process multiple chunks simultaneously (e.g., for a 500-page document with `page_chunk_size=50`, you could use `doc_batch_concurrency=10` to process all 10 chunks in parallel).
+  
+  Example values: Start with 50 or 100 pages per chunk and adjust based on your document size and available memory. Smaller chunk sizes reduce peak memory usage but increase overhead; larger chunks process faster but require more memory per chunk.
 
   **Memory Debugging Endpoints:**

[Accept] [Decline]

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

View Suggested Changes

@@ -16,6 +16,7 @@
     - `generate_page_images`, `generate_picture_images`: Extract page/picture images
     - `force_backend_text`: Force backend text extraction
     - `layout_custom_config`, `table_structure_custom_config`: Custom model configs for layout/table structure (see Table Structure Models section below)
+    - `page_chunk_size` (default: None): Number of pages to process as a single chunk. When processing large PDFs (e.g., 1000+ pages), this limits memory usage and allows streaming chunked results instead of waiting for the entire document. If None, the entire document is processed at once. See Page Chunking section below for details.
     - Additional options for chart extraction, picture description, and more
 
 ---
@@ -67,6 +68,37 @@
 result = converter.convert(source="scanned.pdf")
 text = result.document.export_to_text(traverse_pictures=True)
 markdown = result.document.export_to_markdown(traverse_pictures=True)
+```
+
+- **Page Chunking for Large Documents**: The `page_chunk_size` option enables processing large documents (e.g., 1000+ page PDFs) in chunks to prevent memory overflow (OOM) errors. When configured, the document is split into page chunks that are treated as independent documents and processed based on `doc_batch_concurrency`. Key behaviors:
+    - **Using `convert()` vs `convert_all()`**: When `page_chunk_size` is enabled, `convert()` only returns the first chunk. Use `convert_all()` to stream all chunks of the document.
+    - **Parallelization**: Chunks are treated as independent documents, so increase `doc_batch_concurrency` to process multiple chunks in parallel (e.g., 500 total pages / 50 page chunks = 10 concurrency for 10 chunks).
+    - **Default Behavior**: If `page_chunk_size` is not set (None), the entire document is processed at once.
+
+```python
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.settings import settings
+
+# Configure chunking for large PDFs
+pipeline_options = PdfPipelineOptions()
+pipeline_options.page_chunk_size = 50  # Process 50 pages per chunk
+
+# Increase concurrency to parallelize chunks
+settings.perf.doc_batch_concurrency = 10  # Process 10 chunks in parallel
+settings.perf.doc_batch_size = 10
+
+converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    }
+)
+
+# Use convert_all() to stream all chunks
+for result in converter.convert_all(["large_document.pdf"]):
+    print(f"Processed chunk: pages {result.input.limits.page_range}")
+    # Process each chunk as it completes
 ```
 
 - **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8). Refer to the Python SDK documentation for usage of `format_options`. See the API reference for details on new preset/custom config fields and deprecated options.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

codecov · 2026-03-21T11:31:56Z

Codecov Report

❌ Patch coverage is 34.57944% with 70 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/document_converter.py	32.03%	70 Missing ⚠️

📢 Thoughts on this report? Let us know!

I, Divpreet Dhingra <[email protected]>, hereby add my Signed-off-by to this commit: 4d41038 I, Divpreet Dhingra <[email protected]>, hereby add my Signed-off-by to this commit: ad9d79c Signed-off-by: Divpreet Dhingra <[email protected]>

div-dhingra added 2 commits March 20, 2026 15:38

feat(pipeline): implement page chunking for converting large document…

4d41038

…s (parallelizable)

test(pipeline): add tests for page chunking during document conversion

ad9d79c

div-dhingra force-pushed the feat/page-chunking branch from e28e8ec to 02ea9d6 Compare March 20, 2026 23:39

div-dhingra changed the title ~~Feat(pipeline): implement page chunking for converting large documents (parallelizable)~~ feat(pipeline): implement page chunking for converting large documents (parallelizable) Mar 20, 2026

PeterStaar-IBM requested review from cau-git and dolfim-ibm March 21, 2026 11:13

div-dhingra force-pushed the feat/page-chunking branch from 02ea9d6 to 7269f43 Compare March 22, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): implement page chunking for converting large documents (parallelizable)#3162

feat(pipeline): implement page chunking for converting large documents (parallelizable)#3162
div-dhingra wants to merge 3 commits intodocling-project:mainfrom
div-dhingra:feat/page-chunking

div-dhingra commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Mar 20, 2026

Uh oh!

codecov bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

div-dhingra commented Mar 20, 2026

Summary:

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Mar 20, 2026

How can I set up and run docling-serve on a MacBook Pro using Docker, and what performance and stability considerations should I be aware of?

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

Uh oh!

codecov bot commented Mar 21, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 20, 2026 •

edited

Loading

mergify bot commented Mar 20, 2026 •

edited

Loading