feat: support pdf link extraction in hi_res strategy #3753

christinestraub · 2024-10-25T06:37:58Z

This PR aims to add support for link extraction in pdf hi_res strategy. The partition_pdf() function now supports link extraction when using the hi_res strategy, allowing users to extract hyperlinks from PDF documents.

Summary

Added functionalities to support link extraction in hi_res flow
Enhanced word extraction functionality used for link extraction in both fast and hi_res flows, resulted in more correct start_index and text in links metadata.
Updated ingest fixture update workflow to not skip Astra DB source test

Testing

elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3

…g.py

…ixtures update (#3760) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

CHANGELOG.md

test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json

…_from_obj()`

…ixtures update (#3761) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

…r workflow

…ixtures update (#3765) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

…df-hi_res # Conflicts: # CHANGELOG.md # unstructured/__version__.py

…#3753) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]> Co-authored-by: cragwolfe <[email protected]>

christinestraub added 7 commits October 24, 2024 15:34

refactor: organize PDF link extraction functions in fast strategy

e713ef6

move url_metadata related functions from pdf.py to pdfminer_processin…

c010c20

…g.py

add ability to get urls_metadata from process_data_with_pdfminer

ef718e2

return layouts_urls_metadata in process_file_with_pdfminer()

7a863bf

add parameter layouts_urls_metadata to process_file_with_pdfminer()

485827e

update logic to get layouts url metadata

7223a13

add elements links using _get_links_in_element()

881d6fc

christinestraub force-pushed the feat/support-link-extraction-in-pdf-hi_res branch from 48d0233 to 881d6fc Compare October 28, 2024 17:25

christinestraub and others added 11 commits October 28, 2024 10:26

Merge branch 'main' into feat/support-link-extraction-in-pdf-hi_res

12b53e5

test: fix lint errors

18d670c

update changelog.md and version.py

abfd803

remove unnecessary map function

6b44370

fix import error

7e8ad86

test: fix lint errors

9a3b59f

fix list index out of range

2ad76f8

fix: list index out of range

0445006

test: add unit test for hi_res link extraction

a656b22

refactor: move document_to_element_list from common module to pdf module

406b9d3

feat: support pdf link extraction in hi_res strategy <- Ingest test f…

9cf2f81

…ixtures update (#3760) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

christinestraub marked this pull request as ready for review October 29, 2024 21:13

christinestraub requested review from badGarnet and cragwolfe October 29, 2024 21:24

cragwolfe reviewed Oct 29, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

cragwolfe reviewed Oct 29, 2024

View reviewed changes

test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json Outdated Show resolved Hide resolved

christinestraub and others added 4 commits October 30, 2024 00:49

chore: release version 0.16.4

c2cfd66

refactor: rename get_word_bounding_box_from_element() to `get_words…

d2332ca

…_from_obj()`

feat: enhance word extraction from PDFMiner objects

6fa0c09

feat: support pdf link extraction in hi_res strategy <- Ingest test f…

c5772f3

…ixtures update (#3761) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

christinestraub requested a review from cragwolfe October 30, 2024 16:24

ci: add envs for astradb credentials to ingest-test-fixtures-update-p…

89c97d2

…r workflow

feat: support pdf link extraction in hi_res strategy <- Ingest test f…

7cc86ba

…ixtures update (#3765) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

christinestraub requested a review from ryannikolaidis October 30, 2024 19:51

cragwolfe and others added 2 commits October 30, 2024 20:01

Merge branch 'main' into feat/support-link-extraction-in-pdf-hi_res

909af5a

deps: pin unstructured-ingest to 0.2.1

ae94895

christinestraub enabled auto-merge October 31, 2024 07:35

Merge branch 'refs/heads/main' into feat/support-link-extraction-in-p…

29fe122

…df-hi_res # Conflicts: # CHANGELOG.md # unstructured/__version__.py

cragwolfe approved these changes Oct 31, 2024

View reviewed changes

christinestraub added this pull request to the merge queue Oct 31, 2024

Merged via the queue into main with commit df156eb Oct 31, 2024
41 checks passed

christinestraub deleted the feat/support-link-extraction-in-pdf-hi_res branch October 31, 2024 17:40

artdent mentioned this pull request Jan 28, 2025

feat/Allow PDF partitioning without unstructured_inference #2128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support pdf link extraction in hi_res strategy #3753

feat: support pdf link extraction in hi_res strategy #3753

Uh oh!

christinestraub commented Oct 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: support pdf link extraction in hi_res strategy #3753

feat: support pdf link extraction in hi_res strategy #3753

Uh oh!

Conversation

christinestraub commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

christinestraub commented Oct 25, 2024 •

edited

Loading