Improve OTA provider index refresh resilience by TheJulianJES · Pull Request #1845 · zigpy/zigpy

TheJulianJES · 2026-06-22T22:08:42Z

DRAFT. Maybe this should be split up into three PRs. When reviewing, review the individual commits.

Follow-up to #1837. Makes the per-provider index refresh handle slow, failing, and permanently-dead providers gracefully. Branch: tjj/ota-provider-health.

Changes

Refresh provider indexes concurrently. get_ota_images() previously refreshed each compatible provider's index sequentially, so one slow or unreachable provider delayed every other by up to the 20 s fetch timeout — and dead providers are real (the ThirdReality and Inovelli servers shut down permanently), so the stalls stack up. Refreshes are now gathered concurrently; safe because each refresh only rebinds its own provider's cache bucket and swallows its own exceptions.
Exponential backoff for failed index downloads. A failed download previously set _index_last_updated in a finally block, locking the provider out for the full 24 h expiration: a transient network blip at startup meant no OTA images from that provider for a day. Failures now retry after an exponentially growing delay (30 min base, doubling per consecutive failure, capped at the 24 h TTL), tracked separately from the success timestamp. invalidate_provider_caches() clears the failure state via the new provider.invalidate_index(), so a user-initiated re-check can always retry immediately.
Expire images of providers down for a prolonged time. A provider that cannot be reached cannot withdraw images either, so a permanently dead provider would keep offering its last-known images until restart. When a refresh fails and the last successful refresh is older than STALE_INDEX_EXPIRATION_TIME (3 days), the provider's cached images are dropped with a single warning. Age-based (not failure-count-based) to avoid flapping update entities on intermittent connectivity; recovery repopulates on the next check.
Warn about unverifiable trusted images once per refresh. The "trusted image lacks a SHA3-256 checksum" warning lived in the per-device matching path, so it fired for every device on every check cycle. The check now runs when a provider's index is refreshed: unverifiable trusted images are excluded from the cache up front (so they no longer appear in downgrades either) and the warning fires once per refresh, re-arming naturally on invalidation.

Notes

New provider state: _index_last_success (monotone, not reset by invalidation, so it tracks true staleness), plus failure count/timestamp. New constants INDEX_RETRY_DELAY (30 min) and STALE_INDEX_EXPIRATION_TIME (3 days).
The backoff exponent is capped (min(failures-1, 16)) to avoid a timedelta overflow.

Testing

New tests for concurrent refresh (a provider whose index load blocks until a second provider's load starts — a sequential refresh would stall), exponential backoff lifecycle, stale-provider expiry + recovery, and the once-per-refresh warning. Full suite passes; zigpy/ota/ at 99% coverage.

A slow or unreachable provider previously delayed every other provider by up to the 20 second fetch timeout, since indexes were refreshed sequentially. Dead providers are real (the ThirdReality and Inovelli servers shut down permanently), so the stalls stack up in practice. Refreshes are now gathered concurrently; this is safe because each refresh only rebinds its own provider's cache bucket and swallows its own exceptions.

A failed index download previously set _index_last_updated in a finally block, locking the provider out for the full 24 hour expiration time: a transient network blip at startup meant no OTA images from that provider for a day. Failed downloads now retry after an exponentially growing delay (30 minutes, doubling per consecutive failure, capped at the expiration time), tracked separately from the success timestamp. invalidate_provider_caches() clears the failure state via the new provider.invalidate_index(), so a user-initiated re-check can always retry immediately. The new _index_last_success timestamp is not reset by invalidation and will be used to expire images of providers that have been unreachable for a prolonged time.

A provider that cannot be reached cannot withdraw images either, so a permanently dead provider (e.g. the shut-down ThirdReality and Inovelli servers) would keep offering its last-known images until restart. When an index refresh fails and the last successful refresh is older than STALE_INDEX_EXPIRATION_TIME (3 days), the provider's cached images are now dropped with a single warning. The threshold is age-based rather than failure-count-based so intermittent connectivity does not cause update entities to flap, and recovery repopulates the images on the next check.

The missing-SHA3-256-checksum warning lived in the per-device matching path, so it fired for every device on every check cycle. The check now runs when a provider's index is refreshed: unverifiable trusted images are excluded from the cache up front (so they no longer appear in downgrades either) and the warning fires once per refresh, re-arming naturally on invalidation.

codecov · 2026-06-22T22:11:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.57%. Comparing base (43dea44) to head (8c447d2).

Additional details and impacted files

@@           Coverage Diff           @@
##              dev    #1845   +/-   ##
=======================================
  Coverage   99.56%   99.57%           
=======================================
  Files          64       64           
  Lines       13233    13263   +30     
=======================================
+ Hits        13176    13206   +30     
  Misses         57       57

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zigpy-review-bot

Reviewed the full backoff / stale-expiry state machine and traced each new test against it — the logic holds up well. Notes below are all non-blocking.

Verified:

Backoff lifecycle. _index_last_updated is deliberately not advanced on failure anymore (only on success), so the 24h gate stays open for a down provider and the new _index_failures/_index_last_failure backoff is what throttles retries. Traced test_ota_index_download_failure_backoff step by step — the 30min → doubling → 24h-cap progression and the invalidate_index() reset all behave as asserted.
Stale expiry is genuinely once. It keys off _index_last_success (which invalidate_index() deliberately leaves untouched), and the single-warning behavior relies on del self._image_cache[provider] making the second pass early-return. Cached images always imply a prior success, so _index_last_success is never spuriously epoch when images exist. ✓
Trusted-image filter move is a real fix, not just a relocation. Moving the SHA3-256 check from _remove_colliding_images (upgrades-only) to index-refresh time means unverifiable trusted images are now also kept out of downgrades (built independently from candidates in get_ota_images), which the old placement missed. The leftover assert img.metadata.checksum is not None in _remove_colliding_images stays valid, since any trusted image reaching it now necessarily has a sha3-256 checksum.
No downstream usage of load_index / invalidate_provider_caches / _index_* in zha or zha-device-handlers, so the internal reshuffle carries no pinned-API risk. Full tests/ota/ suite green locally (119 passed); CI + patch coverage green.

Minor (optional): the failure-counter state is now split across two layers — load_index() owns the reset (_index_failures = 0) while the manager owns the increment (record_index_failure()). Correct as written because the timeout is enforced caller-side, but it does mean any future direct load_index() caller would silently skip the backoff. A one-line back-reference on record_index_failure (or asserting the invariant) would harden it.

(Noted it's still marked DRAFT — posting as a comment, not an approval.)

zigpy-review-bot · 2026-06-22T22:13:47Z

+        return True
+
+    def record_index_failure(self) -> None:
+        """Record a failed index download, growing the retry backoff."""


Optional: the increment lives here but the reset-on-success lives in load_index(), and a failure is only ever recorded if the caller wraps the download (the manager does, for the timeout). Worth a short note here that this must be paired with the manager's try/except so a future direct load_index() caller doesn't quietly lose the backoff.

TheJulianJES added 4 commits June 22, 2026 20:40

zigpy-review-bot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve OTA provider index refresh resilience#1845

Improve OTA provider index refresh resilience#1845
TheJulianJES wants to merge 4 commits into
zigpy:devfrom
TheJulianJES:tjj/ota-provider-health

TheJulianJES commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 22, 2026

Uh oh!

zigpy-review-bot left a comment

Uh oh!

zigpy-review-bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheJulianJES commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Notes

Testing

Uh oh!

codecov Bot commented Jun 22, 2026

Codecov Report

Uh oh!

zigpy-review-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zigpy-review-bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheJulianJES commented Jun 22, 2026 •

edited

Loading