Skip to content

Conversation

@christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Oct 25, 2024

This PR aims to add support for link extraction in pdf hi_res strategy. The partition_pdf() function now supports link extraction when using the hi_res strategy, allowing users to extract hyperlinks from PDF documents.

Summary

  • Added functionalities to support link extraction in hi_res flow
  • Enhanced word extraction functionality used for link extraction in both fast and hi_res flows, resulted in more correct start_index and text in links metadata.
  • Updated ingest fixture update workflow to not skip Astra DB source test

Testing

elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3

@christinestraub christinestraub force-pushed the feat/support-link-extraction-in-pdf-hi_res branch from 48d0233 to 881d6fc Compare October 28, 2024 17:25
@christinestraub christinestraub marked this pull request as ready for review October 29, 2024 21:13
christinestraub and others added 4 commits October 30, 2024 00:49
…ixtures update (#3765)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: christinestraub <[email protected]>
…df-hi_res

# Conflicts:
#	CHANGELOG.md
#	unstructured/__version__.py
@christinestraub christinestraub added this pull request to the merge queue Oct 31, 2024
Merged via the queue into main with commit df156eb Oct 31, 2024
41 checks passed
@christinestraub christinestraub deleted the feat/support-link-extraction-in-pdf-hi_res branch October 31, 2024 17:40
temp-adelyn pushed a commit to temp-adelyn/unstructured that referenced this pull request Mar 3, 2025
…#3753)

This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
Co-authored-by: cragwolfe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants