Queue a pending re-interview for a device's next check-in#1851
Draft
TheJulianJES wants to merge 9 commits into
Draft
Queue a pending re-interview for a device's next check-in#1851TheJulianJES wants to merge 9 commits into
TheJulianJES wants to merge 9 commits into
Conversation
When a Zigbee device boots after an OTA, it sends a `QueryNextImageCommand` with its running `current_file_version`. If that version differs from the previously cached value for this cluster, schedule `device.reinterview()` to rebuild endpoints, clusters, and quirks against the new firmware. This makes post-OTA recovery stateless: it works for slow-rebooting devices that miss the post-flash polling window in `update_firmware()`, devices flashed outside HA, and HA restarts during the reboot window.
Replace the fixed post-OTA read_attributes / image_notify / reinterview sequence with a confirmation step that resolves on whichever signal arrives first: - A successful active read of `current_file_version` (kept from the old flow as a probe for quietly-rebooting mains devices, but now best-effort: exhausted retries no longer abort the post-OTA steps). - A new `QueryNextImageCommand` whose `current_file_version` differs from the value the OTA cluster cached pre-flash. The OTA cluster's listener also schedules a re-interview when this happens. - A `Device_annce` from this device after the post-flash reboot. If none arrive within `POST_OTA_CONFIRMATION_TIMEOUT` (5 min), update_firmware() still returns SUCCESS — recovery is deferred to the version-change listener if the device wakes up later. After confirmation, a best-effort `image_notify` is sent so the device refreshes its OTA query cache, and a re-interview is driven directly unless the version-change listener already replaced this device with a re-interviewed one. This fixes the slow-reboot race where the read_attributes retry budget was exhausted before the device finished rebooting (e.g. Hue LLC020 ~102s), causing update_firmware() to throw and skip image_notify and reinterview. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
Switch the post-OTA wait's announce signal from a `device.zdo` listener for `device_announce` to an `application.add_listener` for `device_joined`. `device_joined` is the existing application-level event for new joins / NWK changes (including the Device_annce path), so we don't add a new ZDO-level listener. The version-change signal still covers same-NWK rejoins. Also adds a test verifying that a `device_joined` event for a different device does NOT resolve this device's wait.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a “pending re-interview” mechanism so post-OTA (and firmware-version-change) re-interviews reliably occur for sleepy end devices by deferring the work to the device’s next check-in, while still attempting an immediate re-interview when the device is currently awake.
Changes:
- Add
Device.reinterview_pending+Device.schedule_reinterview_on_checkin()and arm/clear logic tied to a per-device check-in action. - Trigger queued + immediate re-interview on OTA
QueryNextImagefirmware version changes, and queue re-interviews earlier inupdate_firmware()with a confirmation wait that can resolve via multiple signals. - Add/adjust tests covering the new pending flag semantics, check-in action execution/cooldown, and the updated post-OTA confirmation flow.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| zigpy/zcl/clusters/general.py | Schedules re-interview on OTA QueryNextImage firmware version changes, including queueing for next check-in. |
| zigpy/device.py | Implements reinterview_pending, check-in action registry/triggering, and updates OTA update_firmware() confirmation behavior. |
| zigpy/application.py | Triggers check-in actions on stack-reported join/announce and clears pending re-interview on successful swap. |
| tests/test_zcl_clusters.py | Adds tests validating version-change-triggered queued + immediate re-interviews. |
| tests/test_device.py | Adds tests for check-in actions, queued re-interviews, and post-OTA confirmation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Rename `POST_OTA_PROBE_ATTEMPTS` to `POST_OTA_PROBE_RETRIES`: `read_attributes(retries=N)` performs N+1 total attempts, so the old name misrepresented the behavior (which matches dev's existing `retries=10`). - Document that the QueryNextImage confirmation deliberately accepts any query when no pre-flash baseline was cached: the wait only establishes that the device is alive after the flash. - Use `asyncio.create_task()` instead of `asyncio.ensure_future()` for the probe task. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
Sleepy end devices only receive requests for a short window after they send something themselves. Add a small named-action registry on `Device`: `register_checkin_action()` queues a coroutine factory that is run whenever the device shows signs of being awake (any inbound packet, or a stack-reported join/announce). The coroutine returns `True` to unregister the action or `False`/raises to retry on a later check-in, gated by a per-action cooldown (default 5 minutes) so a frequently reporting device does not hammer a failing action. Actions run as application tasks, not device tasks: device-owned tasks are cancelled when a re-interview swaps the device object, which would cancel an action that is itself driving the re-interview. Poll Control check-ins now also request fast polling while check-in actions are pending, keeping the device awake for the queued work. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
- Re-registering an existing action now mutates the stored `_CheckinAction` in place instead of replacing it: a running attempt unregisters itself by object identity on success, so a replacement object would have left completed actions permanently registered. - Document why neither `remove_checkin_action()` nor `on_remove()` cancels a running attempt: `_device_reinterviewed()` calls the old device's `on_remove()` mid-swap, and an action driving that very re-interview must not cancel its own call stack. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
Add a `reinterview_pending` flag to `Device` (a request timestamp, mirroring `last_seen`), settable via the public `schedule_reinterview_on_checkin()`. While set, a check-in action is armed that drives `reinterview()` whenever the device is next heard from, retrying on later check-ins (with a cooldown) until the re-interview succeeds. The flag setter fires `device_reinterview_pending_updated` so it can be persisted. Post-OTA wiring: `update_firmware()` queues the re-interview right after a successful flash, before waiting for confirmation — a sleepy end device that misses the whole confirmation window is re-interviewed when it next reports in. The confirmation wait also resolves when a `device_reinterviewed` event for the device fires (e.g. the check-in path already completed the re-interview mid-wait). The OTA cluster's version-change listener queues the flag too, so its immediate re-interview attempt is retried if the device goes back to sleep. A successful re-interview clears the flag on the new device object; the flag is deliberately not copied to the re-interview shadow. `clone()` copies the raw field without firing events or arming actions. The coordinator itself is never queued. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
9a34f5a to
adfd765
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #1851 +/- ##
==========================================
- Coverage 99.47% 99.47% -0.01%
==========================================
Files 57 57
Lines 12026 12156 +130
==========================================
+ Hits 11963 12092 +129
- Misses 63 64 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DRAFT / EXPERIMENTAL.
Note
Part of a four-PR series making post-OTA re-interviews reliable for sleepy end devices. Merge order (first to last):
Makes post-OTA re-interviews reliable for sleepy end devices by deferring them to the device's next check-in. Branch:
tjj/reinterview-on-checkin, stacked ontjj/checkin-actions. Persistence of the flag follows intjj/reinterview-pending-persistence.Problem
Even with the post-OTA confirmation wait, a sleepy end device can sleep through the whole 5-minute window and through the version-change listener's immediate re-interview attempt (it may doze off again mid-interview). The device then keeps stale endpoints/clusters/quirks until it happens to send another OTA query — often a day later — or is manually re-paired.
Changes
Device.reinterview_pending— a request timestamp mirroringlast_seen(None= not pending). The setter firesdevice_reinterview_pending_updated(for DB persistence in the follow-up) and maintains the invariant flag set ⇔ re-interview check-in action armed, so every writer (OTA, future ZHA use, DB restore) gets arming for free.Device.schedule_reinterview_on_checkin()— public entry point; refuses the coordinator; keeps the original request timestamp on repeat calls. Also usable by ZHA later, e.g. "Reconfigure device" on a sleeping device could queue instead of failing.False) whileota_in_progress/initializing/reinterviewing, otherwise drivesreinterview(). Success is detected as "this device object was swapped out ofapplication.devices" — necessary becausereinterview()swallows its own errors. Retries on later check-ins with the standard cooldown until the swap happens.update_firmware()queues the re-interview immediately after a successful flash, before the confirmation wait — the timeout path is now fully covered. The confirmation wait also resolves on adevice_reinterviewedevent (the check-in path may complete the re-interview mid-wait). The version-change listener queues the flag too before its immediate attempt, so that attempt failing (device back asleep) is no longer a dead end.clone()copies the raw field without firing events or arming actions (needed for the DB snapshot clone in the follow-up).Notes
Re-interview traffic cannot retrigger the action: during a re-interview the shadow device owns
application.devices[ieee], so inbound packets route to the shadow's (empty) registry; outside of it, the running-task guard applies.Testing
New tests: flag/event/arming semantics, coordinator refusal, busy-guard deferral, swap-based success detection, an end-to-end queued re-interview driven by an inbound packet (flag and action gone on the new device),
update_firmware()queueing before a timed-out confirmation, silentclone()copy, confirmation-wait resolution ondevice_reinterviewed, and the version-change listener queueing the flag. Full suite passes.