fix(pdf): propagate hyperlinks to DoclingDocument text items#3131
fix(pdf): propagate hyperlinks to DoclingDocument text items#3131hussainarslan wants to merge 4 commits intodocling-project:mainfrom
Conversation
docling-parse already extracts PdfHyperlink objects with bounding rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem already has a hyperlink field. However, the PDF pipeline never matched hyperlink annotations to text clusters — the data was available but never propagated. Add spatial matching of PDF hyperlinks to text clusters during page assembly, then pass the resolved hyperlink through the reading order model to the final DoclingDocument. Changes: - Add hyperlink field to TextElement (base_models.py) - Add _match_hyperlink() to PageAssembleModel that spatially matches cluster bboxes against hyperlink annotation rects, aggregating coverage per URI to handle wrapped links with multiple rects - Thread hyperlink= through add_text(), add_heading(), add_list_item() calls in ReadingOrderModel - Drop hyperlink on text merge when constituent clusters disagree - Fall back to Path when AnyUrl validation fails (matches HTML backend) - Regenerate affected ground truth files - Add unit tests for _match_hyperlink() edge cases Closes docling-project#3096 Signed-off-by: hussainarslan <[email protected]>
|
✅ DCO Check Passed Thanks @hussainarslan, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| self.options = options | ||
|
|
||
| @staticmethod | ||
| def _match_hyperlink( |
There was a problem hiding this comment.
@hussainarslan For the hyperlinks that are not matched, it might be nice to simply propagate them with a different context-layer?
With this approach, You might lose hyperlinks that were not matched.
There was a problem hiding this comment.
@PeterStaar-IBM Great idea! The FURNITURE or NOTES content-layer could work well for missed hyperlinks. If that sounds reasonable, I can add it. But happy to adjust if you have another suggestion.
There was a problem hiding this comment.
@hussainarslan For the hyperlinks that are not matched, it might be nice to simply propagate them with a different context-layer?
With this approach, You might lose hyperlinks that were not matched.
@PeterStaar-IBM I have added logic that recovers those unmatched hyperlinks as reference items. Let me know if that works.
Track consumed hyperlink indices during cluster matching so that hyperlinks which don't meet the overlap threshold are not silently dropped. Unmatched hyperlinks that overlap text clusters are materialized as synthetic REFERENCE TextElements. Also propagate hyperlinks through FORMULA items in reading-order assembly. Signed-off-by: macbook <[email protected]>
I, hussainarslan <[email protected]>, hereby add my Signed-off-by to this commit: 71a8d90 Signed-off-by: hussainarslan <[email protected]>
|
@hussainarslan could you please check the failed tests? It may be that you need to regenerate the ground truth files, since now you propagate hyperlinks and thus you create more content. Check Reference test documents for more details. |
Update groundtruth files for 2206.01062, 2305.03393v1, and textbox.docx to reflect hyperlink fields on text items and new REFERENCE items for unmatched hyperlinks. Signed-off-by: hussainarslan <[email protected]>
@ceberam I have regenerated the ground truth files and pushed them. |
docling-parse already extracts PdfHyperlink objects with bounding rectangles and URIs into SegmentedPdfPage.hyperlinks, and TextItem already has a hyperlink field. However, the PDF pipeline never matched hyperlink annotations to text clusters — the data was available but never propagated.
Add spatial matching of PDF hyperlinks to text clusters during page assembly, then pass the resolved hyperlink through the reading order model to the final DoclingDocument.
Changes:
Closes #3096
Checklist: