Skip to content

fix(pdf): propagate hyperlinks to DoclingDocument text items#3131

Open
hussainarslan wants to merge 4 commits intodocling-project:mainfrom
hussainarslan:fix/propagate-pdf-hyperlinks
Open

fix(pdf): propagate hyperlinks to DoclingDocument text items#3131
hussainarslan wants to merge 4 commits intodocling-project:mainfrom
hussainarslan:fix/propagate-pdf-hyperlinks

Conversation

@hussainarslan
Copy link

docling-parse already extracts PdfHyperlink objects with bounding rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem already has a hyperlink field. However, the PDF pipeline never matched hyperlink annotations to text clusters — the data was available but never propagated.

Add spatial matching of PDF hyperlinks to text clusters during page assembly, then pass the resolved hyperlink through the reading order model to the final DoclingDocument.

Changes:

  • Add hyperlink field to TextElement (base_models.py)
  • Add _match_hyperlink() to PageAssembleModel that spatially matches cluster bboxes against hyperlink annotation rects, aggregating coverage per URI to handle wrapped links with multiple rects
  • Thread hyperlink= through add_text(), add_heading(), add_list_item() calls in ReadingOrderModel
  • Drop hyperlink on text merge when constituent clusters disagree
  • Fall back to Path when AnyUrl validation fails (matches HTML backend)
  • Regenerate affected ground truth files
  • Add unit tests for _match_hyperlink() edge cases

Closes #3096

Checklist:

  • Documentation — not needed (bug fix, no new API)
  • Examples — not needed (transparent behavior change)
  • Tests — added (10 unit tests covering all edge cases)

docling-parse already extracts PdfHyperlink objects with bounding
rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem
already has a hyperlink field. However, the PDF pipeline never matched
hyperlink annotations to text clusters — the data was available but
never propagated.

Add spatial matching of PDF hyperlinks to text clusters during page
assembly, then pass the resolved hyperlink through the reading order
model to the final DoclingDocument.

Changes:
- Add hyperlink field to TextElement (base_models.py)
- Add _match_hyperlink() to PageAssembleModel that spatially matches
  cluster bboxes against hyperlink annotation rects, aggregating
  coverage per URI to handle wrapped links with multiple rects
- Thread hyperlink= through add_text(), add_heading(), add_list_item()
  calls in ReadingOrderModel
- Drop hyperlink on text merge when constituent clusters disagree
- Fall back to Path when AnyUrl validation fails (matches HTML backend)
- Regenerate affected ground truth files
- Add unit tests for _match_hyperlink() edge cases

Closes docling-project#3096

Signed-off-by: hussainarslan <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

DCO Check Passed

Thanks @hussainarslan, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Mar 14, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@codecov
Copy link

codecov bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 97.84946% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...models/stages/page_assemble/page_assemble_model.py 97.72% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

@PeterStaar-IBM PeterStaar-IBM requested a review from cau-git March 15, 2026 06:17
self.options = options

@staticmethod
def _match_hyperlink(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hussainarslan For the hyperlinks that are not matched, it might be nice to simply propagate them with a different context-layer?

With this approach, You might lose hyperlinks that were not matched.

Copy link
Author

@hussainarslan hussainarslan Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterStaar-IBM Great idea! The FURNITURE or NOTES content-layer could work well for missed hyperlinks. If that sounds reasonable, I can add it. But happy to adjust if you have another suggestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hussainarslan For the hyperlinks that are not matched, it might be nice to simply propagate them with a different context-layer?

With this approach, You might lose hyperlinks that were not matched.

@PeterStaar-IBM I have added logic that recovers those unmatched hyperlinks as reference items. Let me know if that works.

Track consumed hyperlink indices during cluster matching so that
hyperlinks which don't meet the overlap threshold are not silently
dropped. Unmatched hyperlinks that overlap text clusters are
materialized as synthetic REFERENCE TextElements. Also propagate
hyperlinks through FORMULA items in reading-order assembly.

Signed-off-by: macbook <[email protected]>
I, hussainarslan <[email protected]>, hereby add my Signed-off-by to this commit: 71a8d90

Signed-off-by: hussainarslan <[email protected]>
@ceberam
Copy link
Member

ceberam commented Mar 20, 2026

@hussainarslan could you please check the failed tests? It may be that you need to regenerate the ground truth files, since now you propagate hyperlinks and thus you create more content. Check Reference test documents for more details.

Update groundtruth files for 2206.01062, 2305.03393v1, and
textbox.docx to reflect hyperlink fields on text items and
new REFERENCE items for unmatched hyperlinks.

Signed-off-by: hussainarslan <[email protected]>
@hussainarslan
Copy link
Author

@hussainarslan could you please check the failed tests? It may be that you need to regenerate the ground truth files, since now you propagate hyperlinks and thus you create more content. Check Reference test documents for more details.

@ceberam I have regenerated the ground truth files and pushed them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Could not detect hyperlink

3 participants