Queue a pending re-interview for a device's next check-in by TheJulianJES · Pull Request #1851 · zigpy/zigpy

TheJulianJES · 2026-07-03T04:57:44Z

DRAFT / EXPERIMENTAL.

Note

Part of a four-PR series making post-OTA re-interviews reliable for sleepy end devices. Merge order (first to last):

Confirm post-OTA reboot via announce/version-change instead of a fixed retry budget #1849 — Confirm post-OTA reboot via announce/version-change instead of a fixed retry budget
Add generic per-device check-in action mechanism #1850 — Add generic per-device check-in action mechanism
Queue a pending re-interview for a device's next check-in #1851 — Queue a pending re-interview for a device's next check-in (this PR)
Persist pending re-interview requests in the application database #1852 — Persist pending re-interview requests in the application database

Makes post-OTA re-interviews reliable for sleepy end devices by deferring them to the device's next check-in. Branch: tjj/reinterview-on-checkin, stacked on tjj/checkin-actions. Persistence of the flag follows in tjj/reinterview-pending-persistence.

Problem

Even with the post-OTA confirmation wait, a sleepy end device can sleep through the whole 5-minute window and through the version-change listener's immediate re-interview attempt (it may doze off again mid-interview). The device then keeps stale endpoints/clusters/quirks until it happens to send another OTA query — often a day later — or is manually re-paired.

Changes

Device.reinterview_pending — a request timestamp mirroring last_seen (None = not pending). The setter fires device_reinterview_pending_updated (for DB persistence in the follow-up) and maintains the invariant flag set ⇔ re-interview check-in action armed, so every writer (OTA, future ZHA use, DB restore) gets arming for free.
Device.schedule_reinterview_on_checkin() — public entry point; refuses the coordinator; keeps the original request timestamp on repeat calls. Also usable by ZHA later, e.g. "Reconfigure device" on a sleeping device could queue instead of failing.
The check-in action defers (False) while ota_in_progress/initializing/reinterviewing, otherwise drives reinterview(). Success is detected as "this device object was swapped out of application.devices" — necessary because reinterview() swallows its own errors. Retries on later check-ins with the standard cooldown until the swap happens.
OTA wiring. update_firmware() queues the re-interview immediately after a successful flash, before the confirmation wait — the timeout path is now fully covered. The confirmation wait also resolves on a device_reinterviewed event (the check-in path may complete the re-interview mid-wait). The version-change listener queues the flag too before its immediate attempt, so that attempt failing (device back asleep) is no longer a dead end.
Clearing. A successful re-interview clears the flag on the new device object (defensively, via the setter — quirk constructors may copy attributes from the device they replace); the flag is deliberately not copied to the re-interview shadow. On failure the old device stays registered with flag and action intact. clone() copies the raw field without firing events or arming actions (needed for the DB snapshot clone in the follow-up).

Notes

Re-interview traffic cannot retrigger the action: during a re-interview the shadow device owns application.devices[ieee], so inbound packets route to the shadow's (empty) registry; outside of it, the running-task guard applies.

Testing

New tests: flag/event/arming semantics, coordinator refusal, busy-guard deferral, swap-based success detection, an end-to-end queued re-interview driven by an inbound packet (flag and action gone on the new device), update_firmware() queueing before a timed-out confirmation, silent clone() copy, confirmation-wait resolution on device_reinterviewed, and the version-change listener queueing the flag. Full suite passes.

When a Zigbee device boots after an OTA, it sends a `QueryNextImageCommand` with its running `current_file_version`. If that version differs from the previously cached value for this cluster, schedule `device.reinterview()` to rebuild endpoints, clusters, and quirks against the new firmware. This makes post-OTA recovery stateless: it works for slow-rebooting devices that miss the post-flash polling window in `update_firmware()`, devices flashed outside HA, and HA restarts during the reboot window.

Replace the fixed post-OTA read_attributes / image_notify / reinterview sequence with a confirmation step that resolves on whichever signal arrives first: - A successful active read of `current_file_version` (kept from the old flow as a probe for quietly-rebooting mains devices, but now best-effort: exhausted retries no longer abort the post-OTA steps). - A new `QueryNextImageCommand` whose `current_file_version` differs from the value the OTA cluster cached pre-flash. The OTA cluster's listener also schedules a re-interview when this happens. - A `Device_annce` from this device after the post-flash reboot. If none arrive within `POST_OTA_CONFIRMATION_TIMEOUT` (5 min), update_firmware() still returns SUCCESS — recovery is deferred to the version-change listener if the device wakes up later. After confirmation, a best-effort `image_notify` is sent so the device refreshes its OTA query cache, and a re-interview is driven directly unless the version-change listener already replaced this device with a re-interviewed one. This fixes the slow-reboot race where the read_attributes retry budget was exhausted before the device finished rebooting (e.g. Hue LLC020 ~102s), causing update_firmware() to throw and skip image_notify and reinterview. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

Switch the post-OTA wait's announce signal from a `device.zdo` listener for `device_announce` to an `application.add_listener` for `device_joined`. `device_joined` is the existing application-level event for new joins / NWK changes (including the Device_annce path), so we don't add a new ZDO-level listener. The version-change signal still covers same-NWK rejoins. Also adds a test verifying that a `device_joined` event for a different device does NOT resolve this device's wait.

Copilot

Pull request overview

This PR adds a “pending re-interview” mechanism so post-OTA (and firmware-version-change) re-interviews reliably occur for sleepy end devices by deferring the work to the device’s next check-in, while still attempting an immediate re-interview when the device is currently awake.

Changes:

Add Device.reinterview_pending + Device.schedule_reinterview_on_checkin() and arm/clear logic tied to a per-device check-in action.
Trigger queued + immediate re-interview on OTA QueryNextImage firmware version changes, and queue re-interviews earlier in update_firmware() with a confirmation wait that can resolve via multiple signals.
Add/adjust tests covering the new pending flag semantics, check-in action execution/cooldown, and the updated post-OTA confirmation flow.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
zigpy/zcl/clusters/general.py	Schedules re-interview on OTA `QueryNextImage` firmware version changes, including queueing for next check-in.
zigpy/device.py	Implements `reinterview_pending`, check-in action registry/triggering, and updates OTA `update_firmware()` confirmation behavior.
zigpy/application.py	Triggers check-in actions on stack-reported join/announce and clears pending re-interview on successful swap.
tests/test_zcl_clusters.py	Adds tests validating version-change-triggered queued + immediate re-interviews.
tests/test_device.py	Adds tests for check-in actions, queued re-interviews, and post-OTA confirmation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Rename `POST_OTA_PROBE_ATTEMPTS` to `POST_OTA_PROBE_RETRIES`: `read_attributes(retries=N)` performs N+1 total attempts, so the old name misrepresented the behavior (which matches dev's existing `retries=10`). - Document that the QueryNextImage confirmation deliberately accepts any query when no pre-flash baseline was cached: the wait only establishes that the device is alive after the flash. - Use `asyncio.create_task()` instead of `asyncio.ensure_future()` for the probe task. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

Sleepy end devices only receive requests for a short window after they send something themselves. Add a small named-action registry on `Device`: `register_checkin_action()` queues a coroutine factory that is run whenever the device shows signs of being awake (any inbound packet, or a stack-reported join/announce). The coroutine returns `True` to unregister the action or `False`/raises to retry on a later check-in, gated by a per-action cooldown (default 5 minutes) so a frequently reporting device does not hammer a failing action. Actions run as application tasks, not device tasks: device-owned tasks are cancelled when a re-interview swaps the device object, which would cancel an action that is itself driving the re-interview. Poll Control check-ins now also request fast polling while check-in actions are pending, keeping the device awake for the queued work. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

- Re-registering an existing action now mutates the stored `_CheckinAction` in place instead of replacing it: a running attempt unregisters itself by object identity on success, so a replacement object would have left completed actions permanently registered. - Document why neither `remove_checkin_action()` nor `on_remove()` cancels a running attempt: `_device_reinterviewed()` calls the old device's `on_remove()` mid-swap, and an action driving that very re-interview must not cancel its own call stack. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

Add a `reinterview_pending` flag to `Device` (a request timestamp, mirroring `last_seen`), settable via the public `schedule_reinterview_on_checkin()`. While set, a check-in action is armed that drives `reinterview()` whenever the device is next heard from, retrying on later check-ins (with a cooldown) until the re-interview succeeds. The flag setter fires `device_reinterview_pending_updated` so it can be persisted. Post-OTA wiring: `update_firmware()` queues the re-interview right after a successful flash, before waiting for confirmation — a sleepy end device that misses the whole confirmation window is re-interviewed when it next reports in. The confirmation wait also resolves when a `device_reinterviewed` event for the device fires (e.g. the check-in path already completed the re-interview mid-wait). The OTA cluster's version-change listener queues the flag too, so its immediate re-interview attempt is retried if the device goes back to sleep. A successful re-interview clears the flag on the new device object; the flag is deliberately not copied to the re-interview shadow. `clone()` copies the raw field without firing events or arming actions. The coordinator itself is never queued. Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

codecov · 2026-07-03T05:21:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.47%. Comparing base (925ea3d) to head (e24c3d4).

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1851      +/-   ##
==========================================
- Coverage   99.47%   99.47%   -0.01%     
==========================================
  Files          57       57              
  Lines       12026    12156     +130     
==========================================
+ Hits        11963    12092     +129     
- Misses         63       64       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

TheJulianJES added 3 commits July 3, 2026 03:35

This was referenced Jul 3, 2026

Confirm post-OTA reboot via announce/version-change instead of a fixed retry budget #1849

Draft

Add generic per-device check-in action mechanism #1850

Draft

Persist pending re-interview requests in the application database #1852

Draft

TheJulianJES requested a review from Copilot July 3, 2026 05:00

Copilot started reviewing on behalf of TheJulianJES July 3, 2026 05:01 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread zigpy/device.py

Comment thread zigpy/device.py

TheJulianJES added 5 commits July 3, 2026 07:10

Cover the unanswered post-OTA probe path in tests

f06690b

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

TheJulianJES force-pushed the tjj/reinterview-on-checkin branch from 9a34f5a to adfd765 Compare July 3, 2026 05:18

Cover the numeric-timestamp path of the reinterview_pending setter

e24c3d4

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Queue a pending re-interview for a device's next check-in#1851

Queue a pending re-interview for a device's next check-in#1851
TheJulianJES wants to merge 9 commits into
zigpy:devfrom
TheJulianJES:tjj/reinterview-on-checkin

TheJulianJES commented Jul 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheJulianJES commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Notes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheJulianJES commented Jul 3, 2026 •

edited

Loading

codecov Bot commented Jul 3, 2026 •

edited

Loading