Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "slow nodes": un- and re-peering of nodes that Broker drops to due excessive message I/O lag #3981

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ckreibich
Copy link
Member

This supersedes #3951, building on @Neverlord's work. It currently also stacks on top of #3980 until that's merged.

@ckreibich ckreibich force-pushed the topic/christian/disconnect-slow-peers branch 2 times, most recently from c5e6c30 to 775ea2e Compare October 18, 2024 16:57
@ckreibich ckreibich marked this pull request as draft October 21, 2024 05:29
@ckreibich
Copy link
Member Author

This is nearly ready but I've marked it as a draft until some final tweaks are in place, including finalizing the Broker-side PR, and getting #3980 merged first.


@load base/frameworks/telemetry

module Cluster;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this logic into framworks/broker instead? Also wondering how much the existing broker.log covers the logging done here.

I'm also a bit wary about the Cluster::node_slow() - we're not introducing users and wonder if a less generic Broker::peer_forced_disconnect (or something better named - I find even "slow" subjective) would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah — I did wonder about Broker vs Cluster frameworks for this one. I landed on the Cluster framework because the node_up and node_down events fit well to me with the fact that a node is overwhelmed. That's also why I chose "slow node" as the moniker for it, and I'm happy to change it. Basically, there's the Broker-specific fact that it un-peered a node due to backpressure, and the cluster-level observation that a node is overwhelmed.

Broker.log currently captures the event via its existing logging of removed/lost peerings in status updates, if you know the message string to look for. If we had more explicit handling of the reasons why things happen in Broker, there'd be nothing new to do at that level. (I realized while working on this that there is generally more detail available to Broker that Zeek does not report, for example when a peer rejects a peering's new TCP connection because it's redundant to an already existing peering — etc.)

I just noticed that Cluster::node_down only triggers as a derivative of Broker::peer_lost (i.e. from the perspective of the node connected to), but not Broker::peer_removed (the perspective of the node establishing the connection). If we support this from either end, then the Cluster framework will automatically report nodes depeered due to backpressure as down, and we don't need to do anything in there.

Let me try the above, and add an explicit backpressure-induced de-peering event to Broker.

ckreibich and others added 4 commits November 4, 2024 12:44
This adds a zeek_slow_peers counter, labeled by the neighboring peer that a
given node has determined to be slow. Here the node "worker" has observed node
"proxy" falling behind once:

# HELP zeek_slow_peers_total Number of peering drops due to a neighbor falling too far behind in message I/O
# TYPE zeek_slow_peers_total counter
zeek_slow_peers_total{endpoint="worker",peer="proxy"} 1
@ckreibich ckreibich force-pushed the topic/christian/disconnect-slow-peers branch from 775ea2e to 25dac7a Compare November 4, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants