Tags: Unstructured-IO/unstructured
Tags
feat: support pdf link extraction in hi_res strategy (#3753) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]> Co-authored-by: cragwolfe <[email protected]>
feat/remove ingest code, use new dep for tests (#3595) ### Description Alternative to #3572 but maintaining all ingest tests, running them by pulling in the latest version of unstructured-ingest. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]> Co-authored-by: Christine Straub <[email protected]> Co-authored-by: christinestraub <[email protected]>
build(release): release commit for 0.15.14 (#3709) ### Summary - cut release for version `0.15.14` - ignore `vectara` ingest test due to a weird error occurring in: https://github.com/Unstructured-IO/unstructured/actions/runs/11256744351/job/31317150581?pr=3709
fix: correctly install mesa-gl for arm (#3647) ### Summary Fixes the `arm64` image builds, which will be available again starting in version `0.15.13`. A fix was implemented upstream in Unstructured-IO/base-images#47 and a workaround that installed `x86` packages in the `unstructured` repo was removed. ### Testing See [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10948943594/job/30401108059?pr=3647) for a successful `arm64` build on the feature branch.
fix: temporarily disable arm64 build (#3624) ### Summary Per [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10842120429/job/30087252047), `arm64` builds are currently failing, likely because the workaround for the broken `mesa-gl` package from the `wolfi` repository only works for `amd64`. Temporarily disabling the `arm64` build in order to push out the latest `amd64` image with security patches, then will revert and work the fix for the `arm64` image. - Unstructured-IO/base-images#44
PreviousNext