fix: willmsg not published in takeover #11868

qzhuyan · 2023-11-02T12:51:03Z

Fixes #10551
Following the content in MQTT 5.0 spec I draw the state diagram of willmsg.

stateDiagram-v2
  [*] --> will_msg_present: will_msg present in connect
  will_msg_present --> scheduled: server conn \nabnormal shutdown (a.)
  will_msg_present --> scheduled: client conn \nabnormal shutdown (a.)
  will_msg_present --> scheduled: take over (d.)
  will_msg_present --> [*]: discconnect normally (RC 0) (e.)\n will_msg removed
  scheduled --> cancelled: reconnected (c.)
  scheduled --> publish: session ends timeout (a.)
  scheduled --> publish: will_delay timeout (a.)
  will_msg_present --> publish: session ends (clean start) (f.)
  cancelled --> [*] :will_msg removed
  publish --> [*] :will_msg removed

(a.) MQTT-3.1.2-8
(b.) MQTT-3.1.2-10
(c.) MQTT 3.1.3-9
(d.) MQTT-3.1.4-3
(e.) MQTT-3.14.4-3
(f.) ch 3.1.3.2.2

and it could be simplified to (in EMQX terms)

stateDiagram-v2
[*] --> will_msg_present: will_msg present in connect
will_msg_present --> scheduled: sock_close, takeover\n connection abort
will_msg_present --> [*]: disconnect RC 0\n will_msg_removed
will_msg_present --> publish: discard, kick\n session ends
scheduled --> cancelled: reconnect\n taken over
scheduled --> publish: expired\nsession ends
scheduled --> publish: will_delay timeout (a.)
cancelled --> [*] :will_msg removed
publish --> [*] :will_msg removed

So following applies

will_msg is stored in the conn proc heap when connected.
No will_msg publishing if will_msg is absent.
Receiving MQTT.Disconnect RC:0 from client removes the will_msg from proc heap.
When session ends, EMQX publish will_msg.
applies to 'discard', 'internal_error', 'kick' and 'expired'
EMQX defer will_msg publish for connection abortion.
applies to 'takeover' , 'sock_close' (socket error) this is also the default error handling
will_msg and scheduled willmsg publishing must be removed after publish [note 4.]

@note,

EMQX named two takeover scenarios to reflact the term 'takenover' in MQTT 5.0
a. Discard: takeover with clean_session = True
b. Takeover: takeover with clean_session = False
For 'internal_error', we don't have chance to keep the session?
'kick' in EMQX also removes the session
Already triggered will_msg publishing will not work if will_msg is absent.

Summary

`🤖 Generated by Copilot at b497f84`

This pull request fixes a bug with will messages not being published after session takeover, adds a feature to gracefully close the transport layer when shutting down a connection, refactors and simplifies the code for handling will messages and channel termination, and updates the test suite to cover different scenarios for session takeover. It also adds a change log file changes/ce/fix-11868.en.md to document the bug fix.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Added property-based tests for code which performs user input validation
Changed lines covered in coverage report
Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

If changed package build workflow, pass this action (manual trigger)
Change log has been added to changes/ dir for user-facing artifacts update

keynslug

💯

apps/emqx/test/emqx_takeover_SUITE.erl

apps/emqx/src/emqx_channel.erl

apps/emqx/test/emqx_takeover_SUITE.erl

apps/emqx/src/emqx_channel.erl

qzhuyan · 2023-11-10T10:57:21Z

I did some updates. but need to dig more about the possible reasons of shutdown and connection states where willmsg publishing is allowed.

keynslug

👍🏼

keynslug · 2023-11-13T04:19:59Z

apps/emqx/test/emqx_takeover_SUITE.erl

+    ?assertNot(IsWill2),
+    emqtt:stop(CPid2),
+    emqtt:stop(CPidSub),
+    ?assert(not is_process_alive(CPid1)),


Tiny nit. The assert_client_exit/3 call several lines before returns only when CPid1 process exits, so this line seems kinda unnecessary.

I will leave it it is test code, assertions just helps.

apps/emqx/test/emqx_shared_sub_SUITE.erl

apps/emqx/src/emqx_channel.erl

keynslug · 2023-11-13T04:28:06Z

apps/emqx/src/emqx_channel.erl

+    %% a. expired (session expired)
+    %% c. discarded (Session ends because another process starts new session with the same clientid)
+    %% b. kicked. (kicked by operation)
+    %% d. internal_error (maybe not recoverable)


Agree. It seems a bit unsafe to assume that in the event of hitting internal error publishing should still work.

I just list them, may miss something, the entire call path, call stack is hard to follow.

I found this in spec:

In the case of a Server shutdown or failure, the Server MAY defer publication of Will Messages until a subsequent restart. If this happens, there might be a delay between the time the Server experienced failure and when the Will Message is published.

So I think it is ok to either publish/not publish/delay publish willmsg when server error.

keynslug · 2023-11-13T04:38:03Z

apps/emqx/src/emqx_channel.erl

 terminate({shutdown, Reason}, Channel) when
-    Reason =:= discarded;
-    Reason =:= takenover
+    Reason =:= expired orelse
+        Reason =:= takenover orelse
+        Reason =:= kicked orelse
+        Reason =:= discarded
 ->
-    run_terminate_hook(Reason, Channel);
-terminate(Reason, Channel = #channel{clientinfo = ClientInfo, will_msg = WillMsg}) ->
-    %% since will_msg is set to undefined as soon as it is published,
-    %% if will_msg still exists when the session is terminated, it
-    %% must be published immediately.
-    WillMsg =/= undefined andalso publish_will_msg(ClientInfo, WillMsg),
-    run_terminate_hook(Reason, Channel).
+    Channel1 = maybe_publish_will_msg(Reason, Channel),
+    run_terminate_hook(Reason, Channel1);


Nit. This clause's logic is identical to the next catch-all clause, the only difference is taking the inner atom from shutdown tuple. I'd argue this makes the shutdown logic harder to follow.

In general, I suspect that one thing that could help to make things simpler to follow is separating maybe_publish_will_msg into a set of 2 functions: one is for scheduling willmsg publishing (e.g. essentially called only when sock_closed), another one is for deciding what to do on channel termination (when there's no point to schedule anything). This should also eliminate the need to do something explicitly on handle_call(kick, ...), i.e. here.

Nit. This clause's logic is identical to the next catch-all clause, the only difference is taking the inner atom from shutdown tuple. I'd argue this makes the shutdown logic harder to follow.

I thought the same and I did remove it until I found run_terminate_hook relies on the Reason (kicked) instead of {shutdown, kicked}, the hooks are exposed to the users so I think it better to keep it.

In general, I suspect that one thing that could help to make things simpler to follow is separating maybe_publish_will_msg into a set of 2 functions: one is for scheduling willmsg publishing (e.g. essentially called only when sock_closed), another one is for deciding what to do on channel termination (when there's no point to schedule anything). This should also eliminate the need to do something explicitly on handle_call(kick, ...), i.e. here.

I found scheduling willmsg publishing will not always work when the process terminates before the timer get fired.
Simple fix would be to just extend the process lifetime but that is costly and it cannot be done in terminate without changing the whole call stack.

Ah, good catch. However, it also means that terminate hook now will be called with expired, instead of {shutdown, expired} as before. Although I'm not sure if former behavior is important to preserve here.

I found scheduling willmsg publishing will not always work when the process terminates before the timer get fired.

I mean, that seems the reason why the logic is hard to follow: currently maybe_publish_will_msg is called both from sock_closed context (where scheduling timers makes sense) and terminate context (where scheduling timers doesn't make sense). Yet some of its codepaths end up with timers being scheduled anyway, which is confusing: it's very hard to tell in which context they are scheduled. Thus it seems worthwhile to try to separate this into maybe_schedule_will_msg/1 and maybe_publish_will_msg_on_terminate/2, at least on the first look.

In general, I suspect that one thing that could help to make things simpler to follow is separating maybe_publish_will_msg into a set of 2 functions: one is for scheduling willmsg publishing (e.g. essentially called only when sock_closed), another one is for deciding what to do on channel termination (when there's no point to schedule anything). This should also eliminate the need to do something explicitly on handle_call(kick, ...), i.e. here.

I found scheduling willmsg publishing will not always work when the process terminates before the timer get fired. Simple fix would be to just extend the process lifetime but that is costly and it cannot be done in terminate without changing the whole call stack.

I am wrong. we could only schedule willmsg publishing when session expire > 0.

qzhuyan · 2023-11-15T13:03:36Z

apps/emqx/src/emqx_channel.erl

+    Reason =:= expired orelse
+        Reason =:= discarded orelse
+        Reason =:= kicked orelse
+        Reason =:= ?chan_terminating orelse


this keeps the old behavior.

apps/emqx/test/emqx_takeover_SUITE.erl

apps/emqx/src/emqx_channel.erl

because kick means shutdown connection AND delete session

will delay > session expire will delay < session expire timer triggered events are handled in seq, exclude the case of (will delay == session expire)

apps/emqx/src/emqx_channel.erl

zmstone · 2024-02-29T08:36:49Z

kick should not trigger will message due to security reason.

qzhuyan · 2024-02-29T09:29:43Z

kick should not trigger will message due to security reason.

we decided to handle it in separate PR.

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch 4 times, most recently from 482d4af to a4c202e Compare November 6, 2023 14:22

qzhuyan marked this pull request as ready for review November 6, 2023 14:41

qzhuyan requested review from lafirest and a team as code owners November 6, 2023 14:41

keynslug reviewed Nov 7, 2023

View reviewed changes

qzhuyan commented Nov 7, 2023

View reviewed changes

apps/emqx/src/emqx_channel.erl Show resolved Hide resolved

qzhuyan commented Nov 7, 2023

View reviewed changes

apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from a4c202e to f60eb8b Compare November 10, 2023 10:55

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch 2 times, most recently from 805ff6e to 6f779dc Compare November 11, 2023 15:51

keynslug reviewed Nov 13, 2023

View reviewed changes

qzhuyan commented Nov 15, 2023

View reviewed changes

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from 6aff0da to 23c7e59 Compare November 15, 2023 21:11

qzhuyan commented Nov 16, 2023

View reviewed changes

apps/emqx/test/emqx_takeover_SUITE.erl Outdated Show resolved Hide resolved

qzhuyan commented Nov 16, 2023

View reviewed changes

apps/emqx/test/emqx_takeover_SUITE.erl Outdated Show resolved Hide resolved

qzhuyan commented Nov 16, 2023

View reviewed changes

apps/emqx/src/emqx_channel.erl Show resolved Hide resolved

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch 2 times, most recently from 78f4b14 to a0c8c06 Compare November 22, 2023 14:01

qzhuyan requested review from ieQu1, kjellwinblad, sstrigler, JimMoen and HJianBo as code owners November 22, 2023 14:01

qzhuyan changed the base branch from release-53 to master November 22, 2023 14:02

qzhuyan closed this Nov 23, 2023

qzhuyan reopened this Nov 23, 2023

qzhuyan added 6 commits February 22, 2024 16:39

fix(mqtt): ensure publish willmsg when session expires

9da4896

fix(kick): defer willmsg publish when conn terminates

6243cf0

because kick means shutdown connection AND delete session

test(takeover): add back the delay when takeover

b76c701

fix: handle delayed willmsg, part 1

dd62280

test(willmsg): test will delay and session expires

5397402

will delay > session expire will delay < session expire timer triggered events are handled in seq, exclude the case of (will delay == session expire)

test(willmsg): session taken over before willmsg delay /session expire

6311b58

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from a0c8c06 to 125091a Compare February 22, 2024 16:04

qzhuyan added 2 commits February 22, 2024 17:12

fix: maybe send willmsg

e5a3574

chore(willmsg): add come comments

2ff33f9

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from 125091a to 2ff33f9 Compare February 22, 2024 16:14

zmstone reviewed Feb 23, 2024

View reviewed changes

apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved

apps/emqx/src/emqx_channel.erl Show resolved Hide resolved

qzhuyan marked this pull request as draft February 23, 2024 11:22

qzhuyan commented Feb 26, 2024

View reviewed changes

apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved

qzhuyan commented Feb 26, 2024

View reviewed changes

apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved

qzhuyan commented Feb 26, 2024

View reviewed changes

apps/emqx/src/emqx_channel.erl Outdated Show resolved Hide resolved

qzhuyan commented Feb 26, 2024

View reviewed changes

apps/emqx/src/emqx_channel.erl Show resolved Hide resolved

qzhuyan commented Feb 26, 2024

View reviewed changes

apps/emqx/src/emqx_channel.erl Show resolved Hide resolved

qzhuyan added 5 commits February 26, 2024 22:24

chore: clean in tests

975c742

refactor: update notes for willmsg

d5247cb

chore(willmsg): remove unreachable code

88fc67b

chore: update some comments

c8f9ffd

feat(quic): mqtt 5.0 graceful shutdown in takeover

6c7b774

qzhuyan force-pushed the fix/william/takeover-pub-willmsg branch from 26fc157 to 6c7b774 Compare February 26, 2024 22:06

qzhuyan marked this pull request as ready for review February 27, 2024 11:16

zmstone approved these changes Feb 29, 2024

View reviewed changes

qzhuyan merged commit b82973b into emqx:master Feb 29, 2024
166 checks passed

qzhuyan deleted the fix/william/takeover-pub-willmsg branch February 29, 2024 12:14

BrewTestBot mentioned this pull request Mar 28, 2024

emqx 5.6.0 Homebrew/homebrew-core#167392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: willmsg not published in takeover #11868

fix: willmsg not published in takeover #11868

qzhuyan commented Nov 2, 2023 •

edited

Loading

keynslug left a comment

qzhuyan commented Nov 10, 2023

keynslug left a comment

keynslug Nov 13, 2023

qzhuyan Nov 15, 2023

keynslug Nov 13, 2023

qzhuyan Nov 13, 2023

qzhuyan Nov 15, 2023 •

edited

Loading

keynslug Nov 13, 2023

qzhuyan Nov 13, 2023

qzhuyan Nov 13, 2023

keynslug Nov 13, 2023

keynslug Nov 13, 2023

qzhuyan Nov 15, 2023

qzhuyan Nov 15, 2023

zmstone commented Feb 29, 2024

qzhuyan commented Feb 29, 2024

fix: willmsg not published in takeover #11868

fix: willmsg not published in takeover #11868

Conversation

qzhuyan commented Nov 2, 2023 • edited Loading

Summary

🤖 Generated by Copilot at b497f84

PR Checklist

Checklist for CI (.github/workflows) changes

keynslug left a comment

Choose a reason for hiding this comment

qzhuyan commented Nov 10, 2023

keynslug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qzhuyan Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmstone commented Feb 29, 2024

qzhuyan commented Feb 29, 2024

qzhuyan commented Nov 2, 2023 •

edited

Loading

`🤖 Generated by Copilot at b497f84`

qzhuyan Nov 15, 2023 •

edited

Loading