Tags: Unstructured-IO/unstructured
Tags
Fix: partially filled inferred layout mark as not extracted (#4169) This PR fixes an issue where elements with partially filled extracted text is marked as extracted. ## bug scenario This PR adds a new unit test to show case a scenario: - during merging inferred and extracted layout the function `aggregate_embedded_text_by_block` aggregates extracted text that falls into an inferred element; and if all text has the flag `is_extracted` being `"true"` the inferred element is marked as such as well - however, there can be a case where the extracted text only partially fills the inferred element. There might be text in the inferred element region that are not present as extracted text (i.e., require OCR). But the current logic would still mark this inferred element as `is_extracted = "true"` ## Fix The fix adds another check in the function `aggregate_embedded_text_by_block`: if the intersect over union of between the source regions and target region cross a given threshold. This new check correctly identifies the case in the unit test that the inferred element should be be marked a `is_extracted = "false"`.
fix: pin deltalake<1.3.0 for ARM64 Docker builds (#4157) ## Summary - Pin `deltalake<1.3.0` to fix ARM64 Docker build failures ## Problem `deltalake` 1.3.0 is missing Linux ARM64 wheels due to a builder OOM issue on their CI. When pip can't find a wheel, it tries to build from source, which fails because the Wolfi base image doesn't have a C compiler (`cc`). This causes the `unstructured-ingest[delta-table]` install to fail, breaking the ARM64 Docker image. delta-io/delta-rs#4041 ## Solution Temporarily pin `deltalake<1.3.0` until: - deltalake publishes ARM64 wheels for 1.3.0+, OR - unstructured-ingest adds the pin to its `delta-table` extra ## Test plan - [ ] ARM64 Docker build succeeds 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Pins a dependency to unblock ARM64 builds and publishes a patch release. > > - Add `deltalake<1.3.0` to `requirements/ingest/ingest.txt` to avoid missing Linux ARM64 wheels breaking Docker builds > - Bump version to `0.18.26` and add corresponding CHANGELOG entry > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b4f15b4. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.5 <[email protected]>
fix(deps): Update security updates [SECURITY] (#4154) This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [filelock](https://redirect.github.com/tox-dev/py-filelock) | `==3.20.0` → `==3.20.1` |  |  | | [marshmallow](https://redirect.github.com/marshmallow-code/marshmallow) ([changelog](https://marshmallow.readthedocs.io/en/latest/changelog.html)) | `==3.26.1` → `==3.26.2` |  |  | | [pypdf](https://redirect.github.com/py-pdf/pypdf) ([changelog](https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html)) | `==6.3.0` → `==6.4.0` |  |  | | [urllib3](https://redirect.github.com/urllib3/urllib3) ([changelog](https://redirect.github.com/urllib3/urllib3/blob/main/CHANGES.rst)) | `==2.5.0` → `==2.6.0` |  |  | ### GitHub Vulnerability Alerts #### [CVE-2025-68146](https://redirect.github.com/tox-dev/filelock/security/advisories/GHSA-w853-jp5j-5j7f) ### Impact A Time-of-Check-Time-of-Use (TOCTOU) race condition allows local attackers to corrupt or truncate arbitrary user files through symlink attacks. The vulnerability exists in both Unix and Windows lock file creation where filelock checks if a file exists before opening it with O_TRUNC. An attacker can create a symlink pointing to a victim file in the time gap between the check and open, causing os.open() to follow the symlink and truncate the target file. **Who is impacted:** All users of filelock on Unix, Linux, macOS, and Windows systems. The vulnerability cascades to dependent libraries: - **virtualenv users**: Configuration files can be overwritten with virtualenv metadata, leaking sensitive paths - **PyTorch users**: CPU ISA cache or model checkpoints can be corrupted, causing crashes or ML pipeline failures - **poetry/tox users**: through using virtualenv or filelock on their own. Attack requires local filesystem access and ability to create symlinks (standard user permissions on Unix; Developer Mode on Windows 10+). Exploitation succeeds within 1-3 attempts when lock file paths are predictable. ### Patches Fixed in version **3.20.1**. **Unix/Linux/macOS fix:** Added O_NOFOLLOW flag to os.open() in UnixFileLock.\_acquire() to prevent symlink following. **Windows fix:** Added GetFileAttributesW API check to detect reparse points (symlinks/junctions) before opening files in WindowsFileLock.\_acquire(). **Users should upgrade to filelock 3.20.1 or later immediately.** ### Workarounds If immediate upgrade is not possible: 1. Use SoftFileLock instead of UnixFileLock/WindowsFileLock (note: different locking semantics, may not be suitable for all use cases) 2. Ensure lock file directories have restrictive permissions (chmod 0700) to prevent untrusted users from creating symlinks 3. Monitor lock file directories for suspicious symlinks before running trusted applications **Warning:** These workarounds provide only partial mitigation. The race condition remains exploitable. Upgrading to version 3.20.1 is strongly recommended. ______________________________________________________________________ ## Technical Details: How the Exploit Works ### The Vulnerable Code Pattern **Unix/Linux/macOS** (`src/filelock/_unix.py:39-44`): ```python def _acquire(self) -> None: ensure_directory_exists(self.lock_file) open_flags = os.O_RDWR | os.O_TRUNC # (1) Prepare to truncate if not Path(self.lock_file).exists(): # (2) CHECK: Does file exist? open_flags |= os.O_CREAT fd = os.open(self.lock_file, open_flags, ...) # (3) USE: Open and truncate ``` **Windows** (`src/filelock/_windows.py:19-28`): ```python def _acquire(self) -> None: raise_on_not_writable_file(self.lock_file) # (1) Check writability ensure_directory_exists(self.lock_file) flags = os.O_RDWR | os.O_CREAT | os.O_TRUNC # (2) Prepare to truncate fd = os.open(self.lock_file, flags, ...) # (3) Open and truncate ``` ### The Race Window The vulnerability exists in the gap between operations: **Unix variant:** ``` Time Victim Thread Attacker Thread ---- ------------- --------------- T0 Check: lock_file exists? → False T1 ↓ RACE WINDOW T2 Create symlink: lock → victim_file T3 Open lock_file with O_TRUNC → Follows symlink → Opens victim_file → Truncates victim_file to 0 bytes! ☠️ ``` **Windows variant:** ``` Time Victim Thread Attacker Thread ---- ------------- --------------- T0 Check: lock_file writable? T1 ↓ RACE WINDOW T2 Create symlink: lock → victim_file T3 Open lock_file with O_TRUNC → Follows symlink/junction → Opens victim_file → Truncates victim_file to 0 bytes! ☠️ ``` ### Step-by-Step Attack Flow **1. Attacker Setup:** ```python # Attacker identifies target application using filelock lock_path = "/tmp/myapp.lock" # Predictable lock path victim_file = "/home/victim/.ssh/config" # High-value target ``` **2. Attacker Creates Race Condition:** ```python import os import threading def attacker_thread(): # Remove any existing lock file try: os.unlink(lock_path) except FileNotFoundError: pass # Create symlink pointing to victim file os.symlink(victim_file, lock_path) print(f"[Attacker] Created: {lock_path} → {victim_file}") # Launch attack threading.Thread(target=attacker_thread).start() ``` **3. Victim Application Runs:** ```python from filelock import UnixFileLock # Normal application code lock = UnixFileLock("/tmp/myapp.lock") lock.acquire() # ← VULNERABILITY TRIGGERED HERE # At this point, /home/victim/.ssh/config is now 0 bytes! ``` **4. What Happens Inside os.open():** On Unix systems, when `os.open()` is called: ```c // Linux kernel behavior (simplified) int open(const char *pathname, int flags) { struct file *f = path_lookup(pathname); // Resolves symlinks by default! if (flags & O_TRUNC) { truncate_file(f); // ← Truncates the TARGET of the symlink } return file_descriptor; } ``` Without `O_NOFOLLOW` flag, the kernel follows the symlink and truncates the target file. ### Why the Attack Succeeds Reliably **Timing Characteristics:** - **Check operation** (Path.exists()): ~100-500 nanoseconds - **Symlink creation** (os.symlink()): ~1-10 microseconds - **Race window**: ~1-5 microseconds (very small but exploitable) - **Thread scheduling quantum**: ~1-10 milliseconds **Success factors:** 1. **Tight loop**: Running attack in a loop hits the race window within 1-3 attempts 2. **CPU scheduling**: Modern OS thread schedulers frequently context-switch during I/O operations 3. **No synchronization**: No atomic file creation prevents the race 4. **Symlink speed**: Creating symlinks is extremely fast (metadata-only operation) ### Real-World Attack Scenarios **Scenario 1: virtualenv Exploitation** ```python # Victim runs: python -m venv /tmp/myenv # Attacker racing to create: os.symlink("/home/victim/.bashrc", "/tmp/myenv/pyvenv.cfg") # Result: /home/victim/.bashrc overwritten with: # home = /usr/bin/python3 # include-system-site-packages = false # version = 3.11.2 # ← Original .bashrc contents LOST + virtualenv metadata LEAKED to attacker ``` **Scenario 2: PyTorch Cache Poisoning** ```python # Victim runs: import torch # PyTorch checks CPU capabilities, uses filelock on cache # Attacker racing to create: os.symlink("/home/victim/.torch/compiled_model.pt", "/home/victim/.cache/torch/cpu_isa_check.lock") # Result: Trained ML model checkpoint truncated to 0 bytes # Impact: Weeks of training lost, ML pipeline DoS ``` ### Why Standard Defenses Don't Help **File permissions don't prevent this:** - Attacker doesn't need write access to victim_file - os.open() with O_TRUNC follows symlinks using the *victim's* permissions - The victim process truncates its own file **Directory permissions help but aren't always feasible:** - Lock files often created in shared /tmp directory (mode 1777) - Applications may not control lock file location - Many apps use predictable paths in user-writable directories **File locking doesn't prevent this:** - The truncation happens *during* the open() call, before any lock is acquired - fcntl.flock() only prevents concurrent lock acquisition, not symlink attacks ### Exploitation Proof-of-Concept Results From empirical testing with the provided PoCs: **Simple Direct Attack** (`filelock_simple_poc.py`): - Success rate: 33% per attempt (1 in 3 tries) - Average attempts to success: 2.1 - Target file reduced to 0 bytes in \<100ms **virtualenv Attack** (`weaponized_virtualenv.py`): - Success rate: ~90% on first attempt (deterministic timing) - Information leaked: File paths, Python version, system configuration - Data corruption: Complete loss of original file contents **PyTorch Attack** (`weaponized_pytorch.py`): - Success rate: 25-40% per attempt - Impact: Application crashes, model loading failures - Recovery: Requires cache rebuild or model retraining **Discovered and reported by:** George Tsigourakos (@​tsigouris007) #### [CVE-2025-68480](https://redirect.github.com/marshmallow-code/marshmallow/security/advisories/GHSA-428g-f7cq-pgp5) ### Impact `Schema.load(data, many=True)` is vulnerable to denial of service attacks. A moderately sized request can consume a disproportionate amount of CPU time. ### Patches 4.1.2, 3.26.2 ### Workarounds ```py # Fail fast def load_many(schema, data, **kwargs): if not isinstance(data, list): raise ValidationError(['Invalid input type.']) return [schema.load(item, **kwargs) for item in data] ``` #### [CVE-2025-66019](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j) ### Impact An attacker who uses this vulnerability can craft a PDF which leads to a memory usage of up to 1 GB per stream. This requires parsing the content stream of a page using the LZWDecode filter. This is a follow up to [GHSA-jfx9-29x2-rv3j](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j) to align the default limit with the one for *zlib*. ### Patches This has been fixed in [pypdf==6.4.0](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.4.0). ### Workarounds If users cannot upgrade yet, use the line below to overwrite the default in their code: ```python pypdf.filters.LZW_MAX_OUTPUT_LENGTH = 75_000_000 ``` #### [CVE-2025-66418](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53) ## Impact urllib3 supports chained HTTP encoding algorithms for response content according to RFC 9110 (e.g., `Content-Encoding: gzip, zstd`). However, the number of links in the decompression chain was unbounded allowing a malicious server to insert a virtually unlimited number of compression steps leading to high CPU usage and massive memory allocation for the decompressed data. ## Affected usages Applications and libraries using urllib3 version 2.5.0 and earlier for HTTP requests to untrusted sources unless they disable content decoding explicitly. ## Remediation Upgrade to at least urllib3 v2.6.0 in which the library limits the number of links to 5. If upgrading is not immediately possible, use [`preload_content=False`](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o) and ensure that `resp.headers["content-encoding"]` contains a safe number of encodings before reading the response content. #### [CVE-2025-66471](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37) ### Impact urllib3's [streaming API](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o) is designed for the efficient handling of large HTTP responses by reading the content in chunks, rather than loading the entire response body into memory at once. When streaming a compressed response, urllib3 can perform decoding or decompression based on the HTTP `Content-Encoding` header (e.g., `gzip`, `deflate`, `br`, or `zstd`). The library must read compressed data from the network and decompress it until the requested chunk size is met. Any resulting decompressed data that exceeds the requested amount is held in an internal buffer for the next read operation. The decompression logic could cause urllib3 to fully decode a small amount of highly compressed data in a single operation. This can result in excessive resource consumption (high CPU usage and massive memory allocation for the decompressed data; CWE-409) on the client side, even if the application only requested a small chunk of data. ### Affected usages Applications and libraries using urllib3 version 2.5.0 and earlier to stream large compressed responses or content from untrusted sources. `stream()`, `read(amt=256)`, `read1(amt=256)`, `read_chunked(amt=256)`, `readinto(b)` are examples of `urllib3.HTTPResponse` method calls using the affected logic unless decoding is disabled explicitly. ### Remediation Upgrade to at least urllib3 v2.6.0 in which the library avoids decompressing data that exceeds the requested amount. If your environment contains a package facilitating the Brotli encoding, upgrade to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 too. These versions are enforced by the `urllib3[brotli]` extra in the patched versions of urllib3. ### Credits The issue was reported by @​Cycloctane. Supplemental information was provided by @​stamparm during a security audit performed by [7ASecurity](https://7asecurity.com/) and facilitated by [OSTIF](https://ostif.org/). --- ### Release Notes <details> <summary>tox-dev/py-filelock (filelock)</summary> ### [`v3.20.1`](https://redirect.github.com/tox-dev/filelock/releases/tag/3.20.1) [Compare Source](https://redirect.github.com/tox-dev/py-filelock/compare/3.20.0...3.20.1) <!-- Release notes generated using configuration in .github/release.yml at main --> ##### What's Changed - CVE-2025-68146: Fix TOCTOU symlink vulnerability in lock file creation by [@​gaborbernat](https://redirect.github.com/gaborbernat) in [tox-dev/filelock#461](https://redirect.github.com/tox-dev/filelock/pull/461) **Full Changelog**: <tox-dev/filelock@3.20.0...3.20.1> </details> <details> <summary>marshmallow-code/marshmallow (marshmallow)</summary> ### [`v3.26.2`](https://redirect.github.com/marshmallow-code/marshmallow/blob/HEAD/CHANGELOG.rst#3262-2025-12-19) [Compare Source](https://redirect.github.com/marshmallow-code/marshmallow/compare/3.26.1...3.26.2) Bug fixes: - :cve:`2025-68480`: Merge error store messages without rebuilding collections. Thanks 카푸치노 for reporting and :user:`deckar01` for the fix. </details> <details> <summary>py-pdf/pypdf (pypdf)</summary> ### [`v6.4.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-641-2025-12-07) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.3.0...6.4.0) ##### Performance Improvements (PI) - Optimize loop for layout mode text extraction ([#​3543](https://redirect.github.com/py-pdf/pypdf/issues/3543)) ##### Bug Fixes (BUG) - Do not fail on choice field without /Opt key ([#​3540](https://redirect.github.com/py-pdf/pypdf/issues/3540)) ##### Documentation (DOC) - Document possible issues with merge\_page and clipping ([#​3546](https://redirect.github.com/py-pdf/pypdf/issues/3546)) - Add some notes about library security ([#​3545](https://redirect.github.com/py-pdf/pypdf/issues/3545)) ##### Maintenance (MAINT) - Use CORE\_FONT\_METRICS for widths where possible ([#​3526](https://redirect.github.com/py-pdf/pypdf/issues/3526)) [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.4.0...6.4.1) </details> <details> <summary>urllib3/urllib3 (urllib3)</summary> ### [`v2.6.0`](https://redirect.github.com/urllib3/urllib3/blob/HEAD/CHANGES.rst#260-2025-12-05) [Compare Source](https://redirect.github.com/urllib3/urllib3/compare/2.5.0...2.6.0) \================== ## Security - Fixed a security issue where streaming API could improperly handle highly compressed HTTP content ("decompression bombs") leading to excessive resource consumption even when a small amount of data was requested. Reading small chunks of compressed data is safer and much more efficient now. (`GHSA-2xpw-w6gg-jr37 <https://github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37>`\_\_) - Fixed a security issue where an attacker could compose an HTTP response with virtually unlimited links in the `Content-Encoding` header, potentially leading to a denial of service (DoS) attack by exhausting system resources during decoding. The number of allowed chained encodings is now limited to 5. (`GHSA-gm62-xv2j-4w53 <https://github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53>`\_\_) .. caution:: - If urllib3 is not installed with the optional `urllib3[brotli]` extra, but your environment contains a Brotli/brotlicffi/brotlipy package anyway, make sure to upgrade it to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 to benefit from the security fixes and avoid warnings. Prefer using `urllib3[brotli]` to install a compatible Brotli package automatically. - If you use custom decompressors, please make sure to update them to respect the changed API of `urllib3.response.ContentDecoder`. ## Features - Enabled retrieval, deletion, and membership testing in `HTTPHeaderDict` using bytes keys. (`#​3653 <https://github.com/urllib3/urllib3/issues/3653>`\_\_) - Added host and port information to string representations of `HTTPConnection`. (`#​3666 <https://github.com/urllib3/urllib3/issues/3666>`\_\_) - Added support for Python 3.14 free-threading builds explicitly. (`#​3696 <https://github.com/urllib3/urllib3/issues/3696>`\_\_) ## Removals - Removed the `HTTPResponse.getheaders()` method in favor of `HTTPResponse.headers`. Removed the `HTTPResponse.getheader(name, default)` method in favor of `HTTPResponse.headers.get(name, default)`. (`#​3622 <https://github.com/urllib3/urllib3/issues/3622>`\_\_) ## Bugfixes - Fixed redirect handling in `urllib3.PoolManager` when an integer is passed for the retries parameter. (`#​3649 <https://github.com/urllib3/urllib3/issues/3649>`\_\_) - Fixed `HTTPConnectionPool` when used in Emscripten with no explicit port. (`#​3664 <https://github.com/urllib3/urllib3/issues/3664>`\_\_) - Fixed handling of `SSLKEYLOGFILE` with expandable variables. (`#​3700 <https://github.com/urllib3/urllib3/issues/3700>`\_\_) ## Misc - Changed the `zstd` extra to install `backports.zstd` instead of `zstandard` on Python 3.13 and before. (`#​3693 <https://github.com/urllib3/urllib3/issues/3693>`\_\_) - Improved the performance of content decoding by optimizing `BytesQueueBuffer` class. (`#​3710 <https://github.com/urllib3/urllib3/issues/3710>`\_\_) - Allowed building the urllib3 package with newer setuptools-scm v9.x. (`#​3652 <https://github.com/urllib3/urllib3/issues/3652>`\_\_) - Ensured successful urllib3 builds by setting Hatchling requirement to >= 1.27.0. (`#​3638 <https://github.com/urllib3/urllib3/issues/3638>`\_\_) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://redirect.github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://redirect.github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi42Ni4zIiwidXBkYXRlZEluVmVyIjoiNDIuNjYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsic2VjdXJpdHkiXX0=--> Co-authored-by: utic-renovate[bot] <235200891+utic-renovate[bot]@users.noreply.github.com>
fix(deps): Bump fonttools to address cve (#4125) <!-- CURSOR_SUMMARY --> > [!NOTE] > Constrain fonttools to >=4.60.2 (CVE-2025-66034), bump extras to 4.61.0, switch setup_ingest to ubuntu-latest-m, and release 0.18.22. > > - **Dependencies**: > - Constrain `fonttools>=4.60.2` in `requirements/deps/constraints.txt` to address CVE-2025-66034. > - Bump `fonttools` to `4.61.0` in `requirements/extra-*.txt`; refresh files via uv and align constraint references. > - **CI**: > - Update `setup_ingest` job in `.github/workflows/ci.yml` to run on `ubuntu-latest-m`. > - **Release**: > - Bump version to `0.18.22` and update `CHANGELOG.md`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 6ec072e. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
chore: dependency updates to resolve cves (#4124) <!-- CURSOR_SUMMARY --> > [!NOTE] > Release 0.18.21 with broad dependency pin updates across requirements (notably unstructured-inference 1.1.2) to remediate CVEs. > > - **Release** > - Set version to `0.18.21` and update `CHANGELOG.md`. > - **Dependencies** > - Upgrade `unstructured-inference` to `1.1.2` in `requirements/extra-pdf-image.txt` to address CVEs. > - Refresh pins across `requirements/*.txt` (base, dev, test, and extras), including updates like `certifi`, `click`, `pypdf`, `pypandoc`, `paddlepaddle`, `torch`/`torchvision`, `google-auth` stack, `protobuf`, `safetensors`, etc.; normalize pip-compile headers and constraint paths. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit face075. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
fix: sanitize MSG attachment filenames to prevent path traversal (GHS… ( #4117) Summary Fixes path traversal vulnerability in email and MSG attachment filename handling (GHSA-gm8q-m8mv-jj5m). Changes Security Fix Sanitizes attachment filenames in _AttachmentPartitioner for both email.py and msg.py Uses os.path.basename() to strip path components from filenames Normalizes backslashes to forward slashes to handle Windows paths on Unix systems Removes null bytes and other control characters Handles edge cases (empty strings, ".", "..") Defaults to "unknown" for invalid or dangerous filenames Test Coverage Added 17 comprehensive tests covering: Path traversal attempts (../../../etc/passwd) Absolute Unix paths (/etc/passwd) Absolute Windows paths (C:\Windows\System32\config\sam) Null byte injection (file\x00.txt) Dot and dotdot filenames (. and ..) Missing/empty filenames Complex mixed path separators Valid filenames (ensuring they pass through unchanged) Test Results ✅ All 17 new security tests pass ✅ All 129 existing tests pass ✅ No regressions Security Impact Prevents attackers from using malicious attachment filenames to write files outside the intended directory, which could lead to arbitrary file write vulnerabilities. Changes include comprehensive test coverage for various attack vectors and a version bump to 0.18.18. --------- Co-authored-by: Claude <[email protected]>
PreviousNext