Improve OTA provider index refresh resilience#1845
Conversation
A slow or unreachable provider previously delayed every other provider by up to the 20 second fetch timeout, since indexes were refreshed sequentially. Dead providers are real (the ThirdReality and Inovelli servers shut down permanently), so the stalls stack up in practice. Refreshes are now gathered concurrently; this is safe because each refresh only rebinds its own provider's cache bucket and swallows its own exceptions.
A failed index download previously set _index_last_updated in a finally block, locking the provider out for the full 24 hour expiration time: a transient network blip at startup meant no OTA images from that provider for a day. Failed downloads now retry after an exponentially growing delay (30 minutes, doubling per consecutive failure, capped at the expiration time), tracked separately from the success timestamp. invalidate_provider_caches() clears the failure state via the new provider.invalidate_index(), so a user-initiated re-check can always retry immediately. The new _index_last_success timestamp is not reset by invalidation and will be used to expire images of providers that have been unreachable for a prolonged time.
A provider that cannot be reached cannot withdraw images either, so a permanently dead provider (e.g. the shut-down ThirdReality and Inovelli servers) would keep offering its last-known images until restart. When an index refresh fails and the last successful refresh is older than STALE_INDEX_EXPIRATION_TIME (3 days), the provider's cached images are now dropped with a single warning. The threshold is age-based rather than failure-count-based so intermittent connectivity does not cause update entities to flap, and recovery repopulates the images on the next check.
The missing-SHA3-256-checksum warning lived in the per-device matching path, so it fired for every device on every check cycle. The check now runs when a provider's index is refreshed: unverifiable trusted images are excluded from the cache up front (so they no longer appear in downgrades either) and the warning fires once per refresh, re-arming naturally on invalidation.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #1845 +/- ##
=======================================
Coverage 99.56% 99.57%
=======================================
Files 64 64
Lines 13233 13263 +30
=======================================
+ Hits 13176 13206 +30
Misses 57 57 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
zigpy-review-bot
left a comment
There was a problem hiding this comment.
Reviewed the full backoff / stale-expiry state machine and traced each new test against it — the logic holds up well. Notes below are all non-blocking.
Verified:
- Backoff lifecycle.
_index_last_updatedis deliberately not advanced on failure anymore (only on success), so the 24h gate stays open for a down provider and the new_index_failures/_index_last_failurebackoff is what throttles retries. Tracedtest_ota_index_download_failure_backoffstep by step — the 30min → doubling → 24h-cap progression and theinvalidate_index()reset all behave as asserted. - Stale expiry is genuinely once. It keys off
_index_last_success(whichinvalidate_index()deliberately leaves untouched), and the single-warning behavior relies ondel self._image_cache[provider]making the second pass early-return. Cached images always imply a prior success, so_index_last_successis never spuriously epoch when images exist. ✓ - Trusted-image filter move is a real fix, not just a relocation. Moving the SHA3-256 check from
_remove_colliding_images(upgrades-only) to index-refresh time means unverifiable trusted images are now also kept out ofdowngrades(built independently fromcandidatesinget_ota_images), which the old placement missed. The leftoverassert img.metadata.checksum is not Nonein_remove_colliding_imagesstays valid, since any trusted image reaching it now necessarily has a sha3-256 checksum. - No downstream usage of
load_index/invalidate_provider_caches/_index_*in zha or zha-device-handlers, so the internal reshuffle carries no pinned-API risk. Fulltests/ota/suite green locally (119 passed); CI + patch coverage green.
Minor (optional): the failure-counter state is now split across two layers — load_index() owns the reset (_index_failures = 0) while the manager owns the increment (record_index_failure()). Correct as written because the timeout is enforced caller-side, but it does mean any future direct load_index() caller would silently skip the backoff. A one-line back-reference on record_index_failure (or asserting the invariant) would harden it.
(Noted it's still marked DRAFT — posting as a comment, not an approval.)
| return True | ||
|
|
||
| def record_index_failure(self) -> None: | ||
| """Record a failed index download, growing the retry backoff.""" |
There was a problem hiding this comment.
Optional: the increment lives here but the reset-on-success lives in load_index(), and a failure is only ever recorded if the caller wraps the download (the manager does, for the timeout). Worth a short note here that this must be paired with the manager's try/except so a future direct load_index() caller doesn't quietly lose the backoff.
DRAFT. Maybe this should be split up into three PRs. When reviewing, review the individual commits.
Follow-up to #1837. Makes the per-provider index refresh handle slow, failing, and permanently-dead providers gracefully. Branch:
tjj/ota-provider-health.Changes
get_ota_images()previously refreshed each compatible provider's index sequentially, so one slow or unreachable provider delayed every other by up to the 20 s fetch timeout — and dead providers are real (the ThirdReality and Inovelli servers shut down permanently), so the stalls stack up. Refreshes are now gathered concurrently; safe because each refresh only rebinds its own provider's cache bucket and swallows its own exceptions._index_last_updatedin afinallyblock, locking the provider out for the full 24 h expiration: a transient network blip at startup meant no OTA images from that provider for a day. Failures now retry after an exponentially growing delay (30 min base, doubling per consecutive failure, capped at the 24 h TTL), tracked separately from the success timestamp.invalidate_provider_caches()clears the failure state via the newprovider.invalidate_index(), so a user-initiated re-check can always retry immediately.STALE_INDEX_EXPIRATION_TIME(3 days), the provider's cached images are dropped with a single warning. Age-based (not failure-count-based) to avoid flapping update entities on intermittent connectivity; recovery repopulates on the next check.Notes
_index_last_success(monotone, not reset by invalidation, so it tracks true staleness), plus failure count/timestamp. New constantsINDEX_RETRY_DELAY(30 min) andSTALE_INDEX_EXPIRATION_TIME(3 days).min(failures-1, 16)) to avoid atimedeltaoverflow.Testing
New tests for concurrent refresh (a provider whose index load blocks until a second provider's load starts — a sequential refresh would stall), exponential backoff lifecycle, stale-provider expiry + recovery, and the once-per-refresh warning. Full suite passes;
zigpy/ota/at 99% coverage.