Skip to content

Improve OTA provider index refresh resilience#1845

Draft
TheJulianJES wants to merge 4 commits into
zigpy:devfrom
TheJulianJES:tjj/ota-provider-health
Draft

Improve OTA provider index refresh resilience#1845
TheJulianJES wants to merge 4 commits into
zigpy:devfrom
TheJulianJES:tjj/ota-provider-health

Conversation

@TheJulianJES

@TheJulianJES TheJulianJES commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

DRAFT. Maybe this should be split up into three PRs. When reviewing, review the individual commits.

Follow-up to #1837. Makes the per-provider index refresh handle slow, failing, and permanently-dead providers gracefully. Branch: tjj/ota-provider-health.

Changes

  • Refresh provider indexes concurrently. get_ota_images() previously refreshed each compatible provider's index sequentially, so one slow or unreachable provider delayed every other by up to the 20 s fetch timeout — and dead providers are real (the ThirdReality and Inovelli servers shut down permanently), so the stalls stack up. Refreshes are now gathered concurrently; safe because each refresh only rebinds its own provider's cache bucket and swallows its own exceptions.
  • Exponential backoff for failed index downloads. A failed download previously set _index_last_updated in a finally block, locking the provider out for the full 24 h expiration: a transient network blip at startup meant no OTA images from that provider for a day. Failures now retry after an exponentially growing delay (30 min base, doubling per consecutive failure, capped at the 24 h TTL), tracked separately from the success timestamp. invalidate_provider_caches() clears the failure state via the new provider.invalidate_index(), so a user-initiated re-check can always retry immediately.
  • Expire images of providers down for a prolonged time. A provider that cannot be reached cannot withdraw images either, so a permanently dead provider would keep offering its last-known images until restart. When a refresh fails and the last successful refresh is older than STALE_INDEX_EXPIRATION_TIME (3 days), the provider's cached images are dropped with a single warning. Age-based (not failure-count-based) to avoid flapping update entities on intermittent connectivity; recovery repopulates on the next check.
  • Warn about unverifiable trusted images once per refresh. The "trusted image lacks a SHA3-256 checksum" warning lived in the per-device matching path, so it fired for every device on every check cycle. The check now runs when a provider's index is refreshed: unverifiable trusted images are excluded from the cache up front (so they no longer appear in downgrades either) and the warning fires once per refresh, re-arming naturally on invalidation.

Notes

  • New provider state: _index_last_success (monotone, not reset by invalidation, so it tracks true staleness), plus failure count/timestamp. New constants INDEX_RETRY_DELAY (30 min) and STALE_INDEX_EXPIRATION_TIME (3 days).
  • The backoff exponent is capped (min(failures-1, 16)) to avoid a timedelta overflow.

Testing

New tests for concurrent refresh (a provider whose index load blocks until a second provider's load starts — a sequential refresh would stall), exponential backoff lifecycle, stale-provider expiry + recovery, and the once-per-refresh warning. Full suite passes; zigpy/ota/ at 99% coverage.

A slow or unreachable provider previously delayed every other provider
by up to the 20 second fetch timeout, since indexes were refreshed
sequentially. Dead providers are real (the ThirdReality and Inovelli
servers shut down permanently), so the stalls stack up in practice.
Refreshes are now gathered concurrently; this is safe because each
refresh only rebinds its own provider's cache bucket and swallows its
own exceptions.
A failed index download previously set _index_last_updated in a
finally block, locking the provider out for the full 24 hour expiration
time: a transient network blip at startup meant no OTA images from that
provider for a day. Failed downloads now retry after an exponentially
growing delay (30 minutes, doubling per consecutive failure, capped at
the expiration time), tracked separately from the success timestamp.

invalidate_provider_caches() clears the failure state via the new
provider.invalidate_index(), so a user-initiated re-check can always
retry immediately. The new _index_last_success timestamp is not reset
by invalidation and will be used to expire images of providers that
have been unreachable for a prolonged time.
A provider that cannot be reached cannot withdraw images either, so a
permanently dead provider (e.g. the shut-down ThirdReality and Inovelli
servers) would keep offering its last-known images until restart. When
an index refresh fails and the last successful refresh is older than
STALE_INDEX_EXPIRATION_TIME (3 days), the provider's cached images are
now dropped with a single warning. The threshold is age-based rather
than failure-count-based so intermittent connectivity does not cause
update entities to flap, and recovery repopulates the images on the
next check.
The missing-SHA3-256-checksum warning lived in the per-device matching
path, so it fired for every device on every check cycle. The check now
runs when a provider's index is refreshed: unverifiable trusted images
are excluded from the cache up front (so they no longer appear in
downgrades either) and the warning fires once per refresh, re-arming
naturally on invalidation.
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.57%. Comparing base (43dea44) to head (8c447d2).

Additional details and impacted files
@@           Coverage Diff           @@
##              dev    #1845   +/-   ##
=======================================
  Coverage   99.56%   99.57%           
=======================================
  Files          64       64           
  Lines       13233    13263   +30     
=======================================
+ Hits        13176    13206   +30     
  Misses         57       57           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zigpy-review-bot zigpy-review-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the full backoff / stale-expiry state machine and traced each new test against it — the logic holds up well. Notes below are all non-blocking.

Verified:

  • Backoff lifecycle. _index_last_updated is deliberately not advanced on failure anymore (only on success), so the 24h gate stays open for a down provider and the new _index_failures/_index_last_failure backoff is what throttles retries. Traced test_ota_index_download_failure_backoff step by step — the 30min → doubling → 24h-cap progression and the invalidate_index() reset all behave as asserted.
  • Stale expiry is genuinely once. It keys off _index_last_success (which invalidate_index() deliberately leaves untouched), and the single-warning behavior relies on del self._image_cache[provider] making the second pass early-return. Cached images always imply a prior success, so _index_last_success is never spuriously epoch when images exist. ✓
  • Trusted-image filter move is a real fix, not just a relocation. Moving the SHA3-256 check from _remove_colliding_images (upgrades-only) to index-refresh time means unverifiable trusted images are now also kept out of downgrades (built independently from candidates in get_ota_images), which the old placement missed. The leftover assert img.metadata.checksum is not None in _remove_colliding_images stays valid, since any trusted image reaching it now necessarily has a sha3-256 checksum.
  • No downstream usage of load_index / invalidate_provider_caches / _index_* in zha or zha-device-handlers, so the internal reshuffle carries no pinned-API risk. Full tests/ota/ suite green locally (119 passed); CI + patch coverage green.

Minor (optional): the failure-counter state is now split across two layers — load_index() owns the reset (_index_failures = 0) while the manager owns the increment (record_index_failure()). Correct as written because the timeout is enforced caller-side, but it does mean any future direct load_index() caller would silently skip the backoff. A one-line back-reference on record_index_failure (or asserting the invariant) would harden it.

(Noted it's still marked DRAFT — posting as a comment, not an approval.)

Comment thread zigpy/ota/providers.py
return True

def record_index_failure(self) -> None:
"""Record a failed index download, growing the retry backoff."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: the increment lives here but the reset-on-success lives in load_index(), and a failure is only ever recorded if the caller wraps the download (the manager does, for the timeout). Worth a short note here that this must be paired with the manager's try/except so a future direct load_index() caller doesn't quietly lose the backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants