Skip to content

Queue a pending re-interview for a device's next check-in#1851

Draft
TheJulianJES wants to merge 9 commits into
zigpy:devfrom
TheJulianJES:tjj/reinterview-on-checkin
Draft

Queue a pending re-interview for a device's next check-in#1851
TheJulianJES wants to merge 9 commits into
zigpy:devfrom
TheJulianJES:tjj/reinterview-on-checkin

Conversation

@TheJulianJES

@TheJulianJES TheJulianJES commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

DRAFT / EXPERIMENTAL.

Note

Part of a four-PR series making post-OTA re-interviews reliable for sleepy end devices. Merge order (first to last):

  1. Confirm post-OTA reboot via announce/version-change instead of a fixed retry budget #1849 — Confirm post-OTA reboot via announce/version-change instead of a fixed retry budget
  2. Add generic per-device check-in action mechanism #1850 — Add generic per-device check-in action mechanism
  3. Queue a pending re-interview for a device's next check-in #1851 — Queue a pending re-interview for a device's next check-in (this PR)
  4. Persist pending re-interview requests in the application database #1852 — Persist pending re-interview requests in the application database

Makes post-OTA re-interviews reliable for sleepy end devices by deferring them to the device's next check-in. Branch: tjj/reinterview-on-checkin, stacked on tjj/checkin-actions. Persistence of the flag follows in tjj/reinterview-pending-persistence.

Problem

Even with the post-OTA confirmation wait, a sleepy end device can sleep through the whole 5-minute window and through the version-change listener's immediate re-interview attempt (it may doze off again mid-interview). The device then keeps stale endpoints/clusters/quirks until it happens to send another OTA query — often a day later — or is manually re-paired.

Changes

  • Device.reinterview_pending — a request timestamp mirroring last_seen (None = not pending). The setter fires device_reinterview_pending_updated (for DB persistence in the follow-up) and maintains the invariant flag set ⇔ re-interview check-in action armed, so every writer (OTA, future ZHA use, DB restore) gets arming for free.
  • Device.schedule_reinterview_on_checkin() — public entry point; refuses the coordinator; keeps the original request timestamp on repeat calls. Also usable by ZHA later, e.g. "Reconfigure device" on a sleeping device could queue instead of failing.
  • The check-in action defers (False) while ota_in_progress/initializing/reinterviewing, otherwise drives reinterview(). Success is detected as "this device object was swapped out of application.devices" — necessary because reinterview() swallows its own errors. Retries on later check-ins with the standard cooldown until the swap happens.
  • OTA wiring. update_firmware() queues the re-interview immediately after a successful flash, before the confirmation wait — the timeout path is now fully covered. The confirmation wait also resolves on a device_reinterviewed event (the check-in path may complete the re-interview mid-wait). The version-change listener queues the flag too before its immediate attempt, so that attempt failing (device back asleep) is no longer a dead end.
  • Clearing. A successful re-interview clears the flag on the new device object (defensively, via the setter — quirk constructors may copy attributes from the device they replace); the flag is deliberately not copied to the re-interview shadow. On failure the old device stays registered with flag and action intact. clone() copies the raw field without firing events or arming actions (needed for the DB snapshot clone in the follow-up).

Notes

Re-interview traffic cannot retrigger the action: during a re-interview the shadow device owns application.devices[ieee], so inbound packets route to the shadow's (empty) registry; outside of it, the running-task guard applies.

Testing

New tests: flag/event/arming semantics, coordinator refusal, busy-guard deferral, swap-based success detection, an end-to-end queued re-interview driven by an inbound packet (flag and action gone on the new device), update_firmware() queueing before a timed-out confirmation, silent clone() copy, confirmation-wait resolution on device_reinterviewed, and the version-change listener queueing the flag. Full suite passes.

When a Zigbee device boots after an OTA, it sends a `QueryNextImageCommand`
with its running `current_file_version`. If that version differs from the
previously cached value for this cluster, schedule `device.reinterview()`
to rebuild endpoints, clusters, and quirks against the new firmware.

This makes post-OTA recovery stateless: it works for slow-rebooting devices
that miss the post-flash polling window in `update_firmware()`, devices
flashed outside HA, and HA restarts during the reboot window.
Replace the fixed post-OTA read_attributes / image_notify / reinterview
sequence with a confirmation step that resolves on whichever signal
arrives first:

- A successful active read of `current_file_version` (kept from the old
  flow as a probe for quietly-rebooting mains devices, but now
  best-effort: exhausted retries no longer abort the post-OTA steps).
- A new `QueryNextImageCommand` whose `current_file_version` differs from
  the value the OTA cluster cached pre-flash. The OTA cluster's listener
  also schedules a re-interview when this happens.
- A `Device_annce` from this device after the post-flash reboot.

If none arrive within `POST_OTA_CONFIRMATION_TIMEOUT` (5 min),
update_firmware() still returns SUCCESS — recovery is deferred to the
version-change listener if the device wakes up later.

After confirmation, a best-effort `image_notify` is sent so the device
refreshes its OTA query cache, and a re-interview is driven directly
unless the version-change listener already replaced this device with a
re-interviewed one.

This fixes the slow-reboot race where the read_attributes retry budget
was exhausted before the device finished rebooting (e.g. Hue LLC020
~102s), causing update_firmware() to throw and skip image_notify and
reinterview.

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
Switch the post-OTA wait's announce signal from a `device.zdo` listener for
`device_announce` to an `application.add_listener` for `device_joined`.

`device_joined` is the existing application-level event for new joins / NWK
changes (including the Device_annce path), so we don't add a new ZDO-level
listener. The version-change signal still covers same-NWK rejoins.

Also adds a test verifying that a `device_joined` event for a different
device does NOT resolve this device's wait.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a “pending re-interview” mechanism so post-OTA (and firmware-version-change) re-interviews reliably occur for sleepy end devices by deferring the work to the device’s next check-in, while still attempting an immediate re-interview when the device is currently awake.

Changes:

  • Add Device.reinterview_pending + Device.schedule_reinterview_on_checkin() and arm/clear logic tied to a per-device check-in action.
  • Trigger queued + immediate re-interview on OTA QueryNextImage firmware version changes, and queue re-interviews earlier in update_firmware() with a confirmation wait that can resolve via multiple signals.
  • Add/adjust tests covering the new pending flag semantics, check-in action execution/cooldown, and the updated post-OTA confirmation flow.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
zigpy/zcl/clusters/general.py Schedules re-interview on OTA QueryNextImage firmware version changes, including queueing for next check-in.
zigpy/device.py Implements reinterview_pending, check-in action registry/triggering, and updates OTA update_firmware() confirmation behavior.
zigpy/application.py Triggers check-in actions on stack-reported join/announce and clears pending re-interview on successful swap.
tests/test_zcl_clusters.py Adds tests validating version-change-triggered queued + immediate re-interviews.
tests/test_device.py Adds tests for check-in actions, queued re-interviews, and post-OTA confirmation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread zigpy/device.py
Comment thread zigpy/device.py
- Rename `POST_OTA_PROBE_ATTEMPTS` to `POST_OTA_PROBE_RETRIES`:
  `read_attributes(retries=N)` performs N+1 total attempts, so the old
  name misrepresented the behavior (which matches dev's existing
  `retries=10`).
- Document that the QueryNextImage confirmation deliberately accepts any
  query when no pre-flash baseline was cached: the wait only establishes
  that the device is alive after the flash.
- Use `asyncio.create_task()` instead of `asyncio.ensure_future()` for
  the probe task.

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
Sleepy end devices only receive requests for a short window after they
send something themselves. Add a small named-action registry on
`Device`: `register_checkin_action()` queues a coroutine factory that is
run whenever the device shows signs of being awake (any inbound packet,
or a stack-reported join/announce). The coroutine returns `True` to
unregister the action or `False`/raises to retry on a later check-in,
gated by a per-action cooldown (default 5 minutes) so a frequently
reporting device does not hammer a failing action.

Actions run as application tasks, not device tasks: device-owned tasks
are cancelled when a re-interview swaps the device object, which would
cancel an action that is itself driving the re-interview.

Poll Control check-ins now also request fast polling while check-in
actions are pending, keeping the device awake for the queued work.

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
- Re-registering an existing action now mutates the stored
  `_CheckinAction` in place instead of replacing it: a running attempt
  unregisters itself by object identity on success, so a replacement
  object would have left completed actions permanently registered.
- Document why neither `remove_checkin_action()` nor `on_remove()`
  cancels a running attempt: `_device_reinterviewed()` calls the old
  device's `on_remove()` mid-swap, and an action driving that very
  re-interview must not cancel its own call stack.

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
Add a `reinterview_pending` flag to `Device` (a request timestamp,
mirroring `last_seen`), settable via the public
`schedule_reinterview_on_checkin()`. While set, a check-in action is
armed that drives `reinterview()` whenever the device is next heard
from, retrying on later check-ins (with a cooldown) until the
re-interview succeeds. The flag setter fires
`device_reinterview_pending_updated` so it can be persisted.

Post-OTA wiring: `update_firmware()` queues the re-interview right
after a successful flash, before waiting for confirmation — a sleepy
end device that misses the whole confirmation window is re-interviewed
when it next reports in. The confirmation wait also resolves when a
`device_reinterviewed` event for the device fires (e.g. the check-in
path already completed the re-interview mid-wait). The OTA cluster's
version-change listener queues the flag too, so its immediate
re-interview attempt is retried if the device goes back to sleep.

A successful re-interview clears the flag on the new device object; the
flag is deliberately not copied to the re-interview shadow. `clone()`
copies the raw field without firing events or arming actions. The
coordinator itself is never queued.

Claude-Session: https://claude.ai/code/session_01VnDASUDhbLfANUM9UbnAeo
@TheJulianJES TheJulianJES force-pushed the tjj/reinterview-on-checkin branch from 9a34f5a to adfd765 Compare July 3, 2026 05:18
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.47%. Comparing base (925ea3d) to head (e24c3d4).

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1851      +/-   ##
==========================================
- Coverage   99.47%   99.47%   -0.01%     
==========================================
  Files          57       57              
  Lines       12026    12156     +130     
==========================================
+ Hits        11963    12092     +129     
- Misses         63       64       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants