-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981
base: master
Are you sure you want to change the base?
Conversation
c5e6c30
to
775ea2e
Compare
This is nearly ready but I've marked it as a draft until some final tweaks are in place, including finalizing the Broker-side PR, and getting #3980 merged first. |
|
||
@load base/frameworks/telemetry | ||
|
||
module Cluster; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this logic into framworks/broker instead? Also wondering how much the existing broker.log
covers the logging done here.
I'm also a bit wary about the Cluster::node_slow()
- we're not introducing users and wonder if a less generic Broker::peer_forced_disconnect
(or something better named - I find even "slow" subjective) would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah — I did wonder about Broker vs Cluster frameworks for this one. I landed on the Cluster framework because the node_up
and node_down
events fit well to me with the fact that a node is overwhelmed. That's also why I chose "slow node" as the moniker for it, and I'm happy to change it. Basically, there's the Broker-specific fact that it un-peered a node due to backpressure, and the cluster-level observation that a node is overwhelmed.
Broker.log currently captures the event via its existing logging of removed/lost peerings in status updates, if you know the message string to look for. If we had more explicit handling of the reasons why things happen in Broker, there'd be nothing new to do at that level. (I realized while working on this that there is generally more detail available to Broker that Zeek does not report, for example when a peer rejects a peering's new TCP connection because it's redundant to an already existing peering — etc.)
I just noticed that Cluster::node_down
only triggers as a derivative of Broker::peer_lost
(i.e. from the perspective of the node connected to), but not Broker::peer_removed
(the perspective of the node establishing the connection). If we support this from either end, then the Cluster framework will automatically report nodes depeered due to backpressure as down, and we don't need to do anything in there.
Let me try the above, and add an explicit backpressure-induced de-peering event to Broker.
This adds a zeek_slow_peers counter, labeled by the neighboring peer that a given node has determined to be slow. Here the node "worker" has observed node "proxy" falling behind once: # HELP zeek_slow_peers_total Number of peering drops due to a neighbor falling too far behind in message I/O # TYPE zeek_slow_peers_total counter zeek_slow_peers_total{endpoint="worker",peer="proxy"} 1
775ea2e
to
25dac7a
Compare
This supersedes #3951, building on @Neverlord's work. It currently also stacks on top of #3980 until that's merged.