Skip to content

Conversation

@superwiboy
Copy link

What problem does this PR solve?

Docling requires Python 3.12 but the Dockerfile installs 3.11, causing compatibility issues. The docling_parser.py uses outdated API patterns without leveraging modern configuration options for improved accuracy and error handling.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

Changes

Dockerfile

  • Updated base stage: uv python install 3.113.12

pyproject.toml

  • Added dependency: docling>=2.60.0,<3.0.0

deepdoc/parser/docling_parser.py (276 insertions, 70 deletions)

Modern Docling API integration:

  • PdfPipelineOptions with TableFormerMode.ACCURATE for better table extraction
  • EasyOcrOptions configuration for OCR
  • PyPdfiumDocumentBackend support
  • Converter instance caching to avoid repeated initialization

Enhanced error handling:

  • Try-catch blocks with detailed logging (exc_info=True)
  • Failure state caching to prevent DoS from repeated init attempts
  • Graceful degradation when optional features unavailable

Security hardening:

  • Path traversal prevention: sanitize filenames before writing temp files
  • Safe backend instantiation with exception handling
  • Input validation for file paths

Example of improved configuration:

# Before: basic initialization
conv = DocumentConverter()

# After: configured with caching
def _get_converter(self):
    if self._converter is not None:
        return self._converter
    
    pipeline_options = PdfPipelineOptions(
        do_table_structure=True,
        table_structure_options=TableStructureOptions(
            mode=TableFormerMode.ACCURATE,
            do_cell_matching=True
        )
    )
    self._converter = DocumentConverter(format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
            backend=PyPdfiumDocumentBackend()
        )
    })

All changes maintain backward compatibility. Existing DoclingParser() usage unchanged.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌈 python Pull requests that update Python code labels Dec 31, 2025
@superwiboy superwiboy closed this Jan 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🌈 python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant