Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981

ckreibich · 2024-10-18T00:53:59Z

This supersedes #3951, building on @Neverlord's work. It currently also stacks on top of #3980 until that's merged.

ckreibich · 2024-10-21T05:30:32Z

This is nearly ready but I've marked it as a draft until some final tweaks are in place, including finalizing the Broker-side PR, and getting #3980 merged first.

awelzel · 2024-10-22T17:14:09Z

scripts/base/frameworks/cluster/slow-peers.zeek

+
+@load base/frameworks/telemetry
+
+module Cluster;


Could we move this logic into framworks/broker instead? Also wondering how much the existing broker.log covers the logging done here.

I'm also a bit wary about the Cluster::node_slow() - we're not introducing users and wonder if a less generic Broker::peer_forced_disconnect (or something better named - I find even "slow" subjective) would be sufficient.

Ah yeah — I did wonder about Broker vs Cluster frameworks for this one. I landed on the Cluster framework because the node_up and node_down events fit well to me with the fact that a node is overwhelmed. That's also why I chose "slow node" as the moniker for it, and I'm happy to change it. Basically, there's the Broker-specific fact that it un-peered a node due to backpressure, and the cluster-level observation that a node is overwhelmed.

Broker.log currently captures the event via its existing logging of removed/lost peerings in status updates, if you know the message string to look for. If we had more explicit handling of the reasons why things happen in Broker, there'd be nothing new to do at that level. (I realized while working on this that there is generally more detail available to Broker that Zeek does not report, for example when a peer rejects a peering's new TCP connection because it's redundant to an already existing peering — etc.)

I just noticed that Cluster::node_down only triggers as a derivative of Broker::peer_lost (i.e. from the perspective of the node connected to), but not Broker::peer_removed (the perspective of the node establishing the connection). If we support this from either end, then the Cluster framework will automatically report nodes depeered due to backpressure as down, and we don't need to do anything in there.

Let me try the above, and add an explicit backpressure-induced de-peering event to Broker.

This adds a zeek_slow_peers counter, labeled by the neighboring peer that a given node has determined to be slow. Here the node "worker" has observed node "proxy" falling behind once: # HELP zeek_slow_peers_total Number of peering drops due to a neighbor falling too far behind in message I/O # TYPE zeek_slow_peers_total counter zeek_slow_peers_total{endpoint="worker",peer="proxy"} 1

ckreibich mentioned this pull request Oct 18, 2024

Integrate new Broker auto-disconnecting feature #3951

Closed

ckreibich force-pushed the topic/christian/disconnect-slow-peers branch 2 times, most recently from c5e6c30 to 775ea2e Compare October 18, 2024 16:57

ckreibich marked this pull request as draft October 21, 2024 05:29

awelzel reviewed Oct 22, 2024

View reviewed changes

ckreibich and others added 4 commits November 4, 2024 12:44

Bump Broker to pull in disconnect feature

fbb545a

Add Zeek-level configurability of Broker slow-peer disconnects

aa55b4e

Support re-peering with cluster nodes that fall behind

e36b645

ckreibich force-pushed the topic/christian/disconnect-slow-peers branch from 775ea2e to 25dac7a Compare November 4, 2024 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981

Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981

ckreibich commented Oct 18, 2024

ckreibich commented Oct 21, 2024

awelzel Oct 22, 2024

ckreibich Oct 23, 2024


		@load base/frameworks/telemetry

		module Cluster;

Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981

Are you sure you want to change the base?

Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981

Conversation

ckreibich commented Oct 18, 2024

ckreibich commented Oct 21, 2024

awelzel Oct 22, 2024

Choose a reason for hiding this comment

ckreibich Oct 23, 2024

Choose a reason for hiding this comment