-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate hangs/throttling of node heartbeat #52176
Conversation
cc @kubernetes/sig-node-pr-reviews |
f19a6a9
to
ae71fc6
Compare
@@ -395,6 +395,12 @@ func (r *Request) Context(ctx context.Context) *Request { | |||
return r | |||
} | |||
|
|||
// Throttle sets a rate limiter for the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment: It overrides the current throttle from the client, if any
This makes sense to me and is a minimal solution. |
ae71fc6
to
ec7b310
Compare
pkg/kubelet/kubelet_node_status.go
Outdated
client: restclient, | ||
decorator: func(req *rest.Request) *rest.Request { | ||
ctx, cancelFunc := context.WithTimeout(context.Background(), kl.nodeStatusUpdateFrequency) | ||
cancelFuncs = append(cancelFuncs, cancelFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it matters in the context of the current usage, but the decorated client here isn't thread-safe. Maybe worth a comment, at least?
ec7b310
to
616c669
Compare
616c669
to
0fc1d99
Compare
stackdriver flake |
pkg/kubelet/kubelet_node_status.go
Outdated
// each request should not take longer than the update frequency | ||
ctx, cancelFunc := context.WithTimeout(context.Background(), kl.nodeStatusUpdateFrequency) | ||
// free the context after twice that length of time | ||
time.AfterFunc(2*kl.nodeStatusUpdateFrequency, cancelFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool idea
Latest revision looks really clearly done to me |
/assign @derekwaynecarr |
this is a critical bug fix, and labeling appropriately for 1.8 milestone. /approve |
/test pull-kubernetes-e2e-gce-etcd3 |
@liggitt @alexef I believe the issue described in in #41916 is actually an issue with the underlying TCP connection. The fix in this merge implements a timeout on the HTTP request, and does not close and set up a new TCP connection right? This is also discussed in kubernetes-retired/kube-aws#598 |
This does not force closing the TCP connection, but does establish a new one (as demonstrated by the accompanying test, which hangs the first TCP connection and ensures a separate connection is established) |
@liggitt @RyPeck experiment shows that the connection is NOT reestablished on a K8s 1.8.3 cluster. Simple way to try it out on a kubelet is start dropping all packets destined to the API, and then observe how long it takes for the "ESTABLISHED" connection to disappear and new connection attempts to appear. We've also had a 1.8.3 cluster go crazy for some 18 minutes when ELB was scaled down and half of the ELB "endpoints" disappeared. Investigation points to kubelet not bothering to re-establish connections to the live nodes. What we did see happening was 10-second spaced retries/failures from kubelet to report its status, which result in an timeout waiting for HTTP response This particular timeout, however, does not result in the connection being re-established - the next attempt is simply reusing the same TCP pipe. If I read the attached test correctly, it ensures that each status update attempt itself is timed out, and that there are multiple attempts happening. This does not prove that the TCP connection is reestablished. |
We also encountered the problem referenced in #48638 on 1.7.8. This PR did not resolve the issue. |
The attempt counter is only incremented when a new connection is established. Does that not demonstrate that the timeout causes the hung connection to be abandoned and a new connection created?
Can you give more details or a reproducer, possibly as an additional unit test in the same vein as the one included in this PR? |
Sure, let me start with details. Environment: AWS, Kubernetes 1.7.8, CoreOS 1576.4.0. HA masters behind an ELB.
Shortly thereafter, the controller manager reports:
This happens on each kubelet and there is now a production outage due to #45126. We restart kubelets on all worker nodes and now this happens:
The pods take several minutes to get rescheduled and restarted, resolving the outage. We have since switched to NLBs, mitigating this particular failure mode. However, we still occasionally see individual nodes become The best reproduction steps I can suggest at this time are here: #41916 (comment). I'll look into reproducing the specific ELB condition in a cloud provider independent way as well. |
I was able to catch one of these with tcpdump today. Here's the relevant section: The kubelet (10.188.48.76) stops communicating with the apiserver (10.188.21.18) at 12:03:24. It reports an error updating node status at 12:03:34 and continues until restarted at 12:06:39. In the interval between 12:03:24 and 12:06:39, there is no new TCP connection established and the only communication between the kubelet and the apiserver is a backoff of TCP retransmissions. |
Hi all, is there any update on this issue ? |
Any mitigation recommendation ? We are seeing this often in our cluster as well. |
Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using an NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment)
Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using a NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment)
Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using a NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment)
Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using a NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment) NLBs do not support security groups, so we delete the elb SG and open 443 to the world on the master security group
Under 1.9.3 and 1.7.12, we are still seeing this failure cause an outage of a large fraction of production nodes roughly every week or two, across nine clusters in six different AWS regions. Here's a procedure to reproduce the problem under controlled conditions on Kubernetes 1.9.3. I'm not sure how to turn this into a unit test, yet. Intent:In production, we see load balancers in front of the apiserver become unresponsive and stop sending ACKs on established TCP connections. After ~15 minutes, the kubelet socket's TCP retransmission delay is exhausted, the kernel considers the connection failed, and the Go runtime establishes a new TCP connection. 15 minutes is much longer than the usual This procedure demonstrates that the kubelet cannot recover from a stalled TCP connection even though #52176 works as intended. The test simulates the effects of a remote TCP socket becoming unresponsive (not sending ACKs) by interposing a local TCP proxy between kubelet and apiserver, then suspending the proxy child process handling the connection. This doesn't precisely reproduce the production failure: in the test the remote socket continues sending ACKs, and as a consequence, the TCP retransmission timeout is never hit and the local socket never fails. Even so, we can reliably get the kubelet to fail. Procedure:
Summary:A stalled TCP connection to the apiserver still brings down an active node, with all the consequences outlined in kubernetes-retired/kube-aws#598, #41916, #48638, #44557, #48670, and Financial-Times/content-k8s-provisioner@d290fea. I'm not sure where else to put responsibility for maintaining this connection, so my first suggestion would be to have HeartbeatClient force its connection(s) to close on request timeout. |
@jfoy - Just for me to understand, you're whole cluster basically busts right? We're seeing the same thing where nodes go into NotReady and all of the sudden the entire cluster can't do anything. |
@bobbytables Often not the entire cluster, but a big swath of it, yeah. It seems to be random based on which IP address each node's kubelet resolved the ELB name to, the last time it had to start a connection to the apiserver. The nodes drop out for ~15 minutes then suddenly recover, causing much of the load on the cluster to be rescheduled. |
Cool, I'm literally sitting in a post-mortem room talking about this. I'm copying your note. |
@jfoy @bobbytables we noticed that it tended to occur by availability zone. |
This appears to describe the upstream bug: kubernetes/client-go#342 |
@jfoy we just finished troubleshooting exactly the same issue with AWS NLB in US-East1 with a number of different clusters (mix of 1.7.8 and 1.9.3) #61917 Semi-random event that seems to have started to get worse in the last couple of weeks, to the extent that I'm seeing it multiple times per day now. Seems to happen more often at night. Suspect the sheer over-utilization in the region is driving reduced network quality, increasing chance of socket failure. |
@stephbu Thanks, that's really useful to know (and disappointing!) |
The unit test here assumes a good TCP connection, which might not be in real case. |
@zhouhaibing089 That's right. The problem seems to occur when the kernel thinks the socket is still good and is in a TCP retransmission-retry state. I suspect there's a way to recreate that situation in a unit test via raw sockets, but I haven't had time to develop that idea. |
I'm afraid I haven't had any time to look into this further and write a concrete test case, but in case it's helpful: as I mentioned in #48670 (comment) we have been running that patch in production since August and have not seen this issue re-occur. |
We observed the same issue as reported in #48670, tested the PR for our case and all works. Is there a possibility of reopening the PR as it appears others could benefit from it? |
We've moved over to NLBs entirely and we're still hitting the issue on a node-by-node basis. I'm going to see what we can do for additional instrumentation to find out who's dropping the ball between the apiserver, the NLB, and the kubelet. EDIT: moving conversation over to #48638 |
Fixes #48638
Fixes #50304
Stops kubelet from wedging when updating node status if unable to establish tcp connection.
Notes that this only affects the node status loop. The pod sync loop would still hang until the dead TCP connections timed out, so more work is needed to keep the sync loop responsive in the face of network issues, but this change lets existing pods coast without the node controller trying to evict them