Eliminate hangs/throttling of node heartbeat #52176

liggitt · 2017-09-08T15:27:54Z

Stops kubelet from wedging when updating node status if unable to establish tcp connection.

Notes that this only affects the node status loop. The pod sync loop would still hang until the dead TCP connections timed out, so more work is needed to keep the sync loop responsive in the face of network issues, but this change lets existing pods coast without the node controller trying to evict them

kubelet to master communication when doing node status updates now has a timeout to prevent indefinite hangs

liggitt · 2017-09-08T15:28:56Z

cc @kubernetes/sig-node-pr-reviews

smarterclayton · 2017-09-08T15:47:40Z

staging/src/k8s.io/client-go/rest/request.go

@@ -395,6 +395,12 @@ func (r *Request) Context(ctx context.Context) *Request {
 	return r
 }

+// Throttle sets a rate limiter for the request.


Add comment: It overrides the current throttle from the client, if any

smarterclayton · 2017-09-08T15:48:07Z

This makes sense to me and is a minimal solution.

ironcladlou · 2017-09-08T20:03:29Z

pkg/kubelet/kubelet_node_status.go

+			client: restclient,
+			decorator: func(req *rest.Request) *rest.Request {
+				ctx, cancelFunc := context.WithTimeout(context.Background(), kl.nodeStatusUpdateFrequency)
+				cancelFuncs = append(cancelFuncs, cancelFunc)


Not sure it matters in the context of the current usage, but the decorated client here isn't thread-safe. Maybe worth a comment, at least?

liggitt · 2017-09-09T01:02:59Z

stackdriver flake
/retest

ironcladlou · 2017-09-11T13:39:48Z

pkg/kubelet/kubelet_node_status.go

+				// each request should not take longer than the update frequency
+				ctx, cancelFunc := context.WithTimeout(context.Background(), kl.nodeStatusUpdateFrequency)
+				// free the context after twice that length of time
+				time.AfterFunc(2*kl.nodeStatusUpdateFrequency, cancelFunc)


ironcladlou · 2017-09-11T13:40:28Z

Latest revision looks really clearly done to me

derekwaynecarr · 2017-09-11T14:20:32Z

/assign @derekwaynecarr

derekwaynecarr · 2017-09-11T14:34:51Z

this is a critical bug fix, and labeling appropriately for 1.8 milestone.

/approve
/lgtm

derekwaynecarr · 2017-09-11T14:36:30Z

/test pull-kubernetes-e2e-gce-etcd3

RyPeck · 2017-12-15T20:23:07Z

@liggitt @alexef I believe the issue described in in #41916 is actually an issue with the underlying TCP connection. The fix in this merge implements a timeout on the HTTP request, and does not close and set up a new TCP connection right?

This is also discussed in kubernetes-retired/kube-aws#598

liggitt · 2017-12-17T14:12:32Z

This does not force closing the TCP connection, but does establish a new one (as demonstrated by the accompanying test, which hangs the first TCP connection and ensures a separate connection is established)

gtie · 2017-12-21T10:46:05Z

@liggitt @RyPeck experiment shows that the connection is NOT reestablished on a K8s 1.8.3 cluster. Simple way to try it out on a kubelet is start dropping all packets destined to the API, and then observe how long it takes for the "ESTABLISHED" connection to disappear and new connection attempts to appear.

We've also had a 1.8.3 cluster go crazy for some 18 minutes when ELB was scaled down and half of the ELB "endpoints" disappeared. Investigation points to kubelet not bothering to re-establish connections to the live nodes.

What we did see happening was 10-second spaced retries/failures from kubelet to report its status, which result in an timeout waiting for HTTP response This particular timeout, however, does not result in the connection being re-established - the next attempt is simply reusing the same TCP pipe.

If I read the attached test correctly, it ensures that each status update attempt itself is timed out, and that there are multiple attempts happening. This does not prove that the TCP connection is reestablished.

vboginskey · 2018-01-03T21:40:45Z

We also encountered the problem referenced in #48638 on 1.7.8. This PR did not resolve the issue.

liggitt · 2018-01-03T21:55:17Z

This does not prove that the TCP connection is reestablished.

The attempt counter is only incremented when a new connection is established. Does that not demonstrate that the timeout causes the hung connection to be abandoned and a new connection created?

We also encountered the problem referenced in #48638 on 1.7.8. This PR did not resolve the issue.

Can you give more details or a reproducer, possibly as an additional unit test in the same vein as the one included in this PR?

vboginskey · 2018-01-03T22:17:19Z

Can you give more details or a reproducer, possibly as an additional unit test in the same vein as the one included in this PR?

Sure, let me start with details.

Environment: AWS, Kubernetes 1.7.8, CoreOS 1576.4.0. HA masters behind an ELB.
Event: The master ELB changes IPs, as confirmed by AWS CloudTrail. Immediately, all kubelets start reporting:

E1227 22:45:20.510774    3983 kubelet_node_status.go:357] Error updating node status, will retry: error getting node "ip-10-188-36-106.ec2.internal": Get https://k8s.vpc.gh.internal/api/v1/nodes/ip-10-188-36-106.ec2.internal: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Shortly thereafter, the controller manager reports:

I1227 22:45:37.638923       9 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-188-36-106.ec2.internal", UID:"e1c2c442-e4e0-11e7-927b-124b89f5bfb0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-10-188-36-106.ec2.internal status is now: NodeNotReady

This happens on each kubelet and there is now a production outage due to #45126.

We restart kubelets on all worker nodes and now this happens:

I1227 22:49:58.239809       9 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-188-36-106.ec2.internal", UID:"e1c2c442-e4e0-11e7-927b-124b89f5bfb0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeletingAllPods' Node ip-10-188-36-106.ec2.internal event: Deleting all Pods from Node ip-10-188-36-106.ec2.internal.

The pods take several minutes to get rescheduled and restarted, resolving the outage.

We have since switched to NLBs, mitigating this particular failure mode. However, we still occasionally see individual nodes become NotReady, with the same logs. In the absence of operator intervention, each event lasts about 15 minutes, which corroborates #41916 (comment). Logging onto an affected node and curling the API endpoint manually returns the response as expected.

The best reproduction steps I can suggest at this time are here: #41916 (comment). I'll look into reproducing the specific ELB condition in a cloud provider independent way as well.

vboginskey · 2018-01-04T17:38:44Z

I was able to catch one of these with tcpdump today. Here's the relevant section:

The kubelet (10.188.48.76) stops communicating with the apiserver (10.188.21.18) at 12:03:24. It reports an error updating node status at 12:03:34 and continues until restarted at 12:06:39.

In the interval between 12:03:24 and 12:06:39, there is no new TCP connection established and the only communication between the kubelet and the apiserver is a backoff of TCP retransmissions.

cbonte · 2018-01-19T09:38:35Z

Hi all, is there any update on this issue ?
We encountered those hangs on kubernetes 1.6.x where the fix was not backported, but reading the recent feedbacks, it seems an upgrade to 1.7/1.8 won't guarantee it will 100% resolve it.

lokesh-shekar · 2018-02-12T06:20:49Z

Any mitigation recommendation ? We are seeing this often in our cluster as well.

Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using an NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment)

Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using a NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment)

Primarily to mitigate kubernetes issue with kubeletes losing connection to masters ( ~06:20 in the morning for us ), and after a period ejecting pods, until that connection is reistablished. Some of the related issues and pull requests: - kubernetes/kubernetes#41916 - kubernetes/kubernetes#48638 - kubernetes/kubernetes#52176 Using a NLB is a suggested mitigation from: kubernetes/kubernetes#52176 (comment) NLBs do not support security groups, so we delete the elb SG and open 443 to the world on the master security group

jfoy · 2018-03-05T22:58:16Z

Under 1.9.3 and 1.7.12, we are still seeing this failure cause an outage of a large fraction of production nodes roughly every week or two, across nine clusters in six different AWS regions.

Here's a procedure to reproduce the problem under controlled conditions on Kubernetes 1.9.3. I'm not sure how to turn this into a unit test, yet.

Intent:

In production, we see load balancers in front of the apiserver become unresponsive and stop sending ACKs on established TCP connections. After ~15 minutes, the kubelet socket's TCP retransmission delay is exhausted, the kernel considers the connection failed, and the Go runtime establishes a new TCP connection.

15 minutes is much longer than the usual node-status-update-frequency/node-monitor-grace-period. We need to identify and recover from this failure on the scale of 10-30s. The patch in #52176 attempts to meet this requirement by setting a ~10s HTTP request timeout on the client sending node status updates.

This procedure demonstrates that the kubelet cannot recover from a stalled TCP connection even though #52176 works as intended. The test simulates the effects of a remote TCP socket becoming unresponsive (not sending ACKs) by interposing a local TCP proxy between kubelet and apiserver, then suspending the proxy child process handling the connection. This doesn't precisely reproduce the production failure: in the test the remote socket continues sending ACKs, and as a consequence, the TCP retransmission timeout is never hit and the local socket never fails.

Even so, we can reliably get the kubelet to fail.

Procedure:

Start with a running cluster node with a healthy kubelet. In one local terminal, observe node health:
kubectl get $node -w
Start two terminal sessions to the victim node.
In terminal session A, start a socat session to proxy TCP traffic. Make sure to enable fork behavior on the listening socket to handle multiple connections:
socat -d -d -d TCP4-LISTEN:60443,fork TCP4:$apiserver_ip:443
In terminal session B, update your kubelet configuration to point to 127.0.0.1:60443. You may have to add an entry to the node's /etc/hosts to convince the TLS client that this is a valid address for the apiserver's cert.
In session B, bounce the kubelet so it's going through your local proxy. In session A you should now see socat logging healthy connection information. The node should still be healthy in the apiserver; pause to verify this if needed.
In session B, identify the child process spawned by socat to handle the active TCP connection. I used this command to discover it:
ps -u"$USER" -opid,ppid,pgrp,cmd
In session B, pause (do not terminate) the socat child process.
kill -s TSTP $childpid
When the node heartbeat timeout/retries have expired you should see the node become NotReady. The kubelet will log something like this, repeating ad nauseam:

kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133?resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:383] Error updating node status, will retry: error getting node "ip-10-31-100-133": Get https://controller:60443/api/v1/nodes/ip-10-31-100-133: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count

In session B, show that a new TCP session succeeds against the same local proxy port:
kubectl --kube-config /etc/kubernetes/kubeconfig.yaml get nodes
In session B, bounce the kubelet. The new kubelet process will establish a new connection and the node will recover. You can also verify recovery by terminating the stopped proxy process with kill -s 9 $childpid or by reviving the stopped proxy with kill -s CONT $childpid.

Summary:

A stalled TCP connection to the apiserver still brings down an active node, with all the consequences outlined in kubernetes-retired/kube-aws#598, #41916, #48638, #44557, #48670, and Financial-Times/content-k8s-provisioner@d290fea.

I'm not sure where else to put responsibility for maintaining this connection, so my first suggestion would be to have HeartbeatClient force its connection(s) to close on request timeout.

bobbytables · 2018-03-05T23:21:35Z

@jfoy - Just for me to understand, you're whole cluster basically busts right? We're seeing the same thing where nodes go into NotReady and all of the sudden the entire cluster can't do anything.

jfoy · 2018-03-05T23:26:37Z

@bobbytables Often not the entire cluster, but a big swath of it, yeah. It seems to be random based on which IP address each node's kubelet resolved the ELB name to, the last time it had to start a connection to the apiserver. The nodes drop out for ~15 minutes then suddenly recover, causing much of the load on the cluster to be rescheduled.

bobbytables · 2018-03-05T23:28:09Z

@bobbytables Often not the entire cluster, but a big swath of it, yeah. It seems to be random based on which IP address each node's kubelet resolved the ELB name to, the last time it had to start a connection to the apiserver. The nodes drop out for ~15 minutes then suddenly recover, causing much of the load on the cluster to be rescheduled.

Cool, I'm literally sitting in a post-mortem room talking about this. I'm copying your note.

RyPeck · 2018-03-06T01:38:50Z

@jfoy @bobbytables we noticed that it tended to occur by availability zone.

jfoy · 2018-03-07T22:47:20Z

This appears to describe the upstream bug: kubernetes/client-go#342

stephbu · 2018-03-30T20:56:27Z

@jfoy we just finished troubleshooting exactly the same issue with AWS NLB in US-East1 with a number of different clusters (mix of 1.7.8 and 1.9.3) #61917

Semi-random event that seems to have started to get worse in the last couple of weeks, to the extent that I'm seeing it multiple times per day now. Seems to happen more often at night. Suspect the sheer over-utilization in the region is driving reduced network quality, increasing chance of socket failure.

jfoy · 2018-03-30T22:05:19Z

@stephbu Thanks, that's really useful to know (and disappointing!)

zhouhaibing089 · 2018-03-30T22:07:15Z

The unit test here assumes a good TCP connection, which might not be in real case.

jfoy · 2018-03-30T22:15:16Z

@zhouhaibing089 That's right. The problem seems to occur when the kernel thinks the socket is still good and is in a TCP retransmission-retry state. I suspect there's a way to recreate that situation in a unit test via raw sockets, but I haven't had time to develop that idea.

obeattie · 2018-03-31T08:24:37Z

I'm afraid I haven't had any time to look into this further and write a concrete test case, but in case it's helpful: as I mentioned in #48670 (comment) we have been running that patch in production since August and have not seen this issue re-occur.

henderjm · 2018-04-09T10:57:00Z

We observed the same issue as reported in #48670, tested the PR for our case and all works. Is there a possibility of reopening the PR as it appears others could benefit from it?

jfoy · 2018-06-05T16:50:06Z

We've moved over to NLBs entirely and we're still hitting the issue on a node-by-node basis. I'm going to see what we can do for additional instrumentation to find out who's dropping the ball between the apiserver, the NLB, and the kubelet.

EDIT: moving conversation over to #48638

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2017

k8s-github-robot assigned cjcullen and erictune Sep 8, 2017

k8s-github-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Sep 8, 2017

liggitt mentioned this pull request Sep 8, 2017

TCP user timeout for Kubelet ↔️ apiserver connection #48670

Closed

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 8, 2017

liggitt force-pushed the heartbeat-timeout branch from f19a6a9 to ae71fc6 Compare September 8, 2017 15:36

smarterclayton reviewed Sep 8, 2017

View reviewed changes

smarterclayton added this to the v1.8 milestone Sep 8, 2017

liggitt force-pushed the heartbeat-timeout branch from ae71fc6 to ec7b310 Compare September 8, 2017 18:54

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 8, 2017

ironcladlou reviewed Sep 8, 2017

View reviewed changes

liggitt force-pushed the heartbeat-timeout branch from ec7b310 to 616c669 Compare September 8, 2017 23:17

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2017

liggitt force-pushed the heartbeat-timeout branch from 616c669 to 0fc1d99 Compare September 8, 2017 23:34

ironcladlou reviewed Sep 11, 2017

View reviewed changes

k8s-ci-robot assigned derekwaynecarr Sep 11, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2017

k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 11, 2017

vboginskey mentioned this pull request Jan 3, 2018

kubelet fails to heartbeat with API server with stuck TCP connections #48638

Closed

george-angel mentioned this pull request Feb 23, 2018

Switch master ELB to NLB utilitywarehouse/tf_kube_aws#23

Merged

jfoy mentioned this pull request Mar 8, 2018

Client should expose a mechanism to close underlying TCP connections kubernetes/client-go#374

Closed

stephbu mentioned this pull request Mar 30, 2018

Kubelet unable to send status updates(heartbeats) to APIServer #61917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate hangs/throttling of node heartbeat #52176

Eliminate hangs/throttling of node heartbeat #52176

liggitt commented Sep 8, 2017 •

edited

Loading

liggitt commented Sep 8, 2017

smarterclayton Sep 8, 2017

smarterclayton commented Sep 8, 2017

ironcladlou Sep 8, 2017

liggitt commented Sep 9, 2017

ironcladlou Sep 11, 2017

ironcladlou commented Sep 11, 2017

derekwaynecarr commented Sep 11, 2017

derekwaynecarr commented Sep 11, 2017

derekwaynecarr commented Sep 11, 2017

RyPeck commented Dec 15, 2017

liggitt commented Dec 17, 2017

gtie commented Dec 21, 2017

vboginskey commented Jan 3, 2018

liggitt commented Jan 3, 2018

vboginskey commented Jan 3, 2018

vboginskey commented Jan 4, 2018

cbonte commented Jan 19, 2018

lokesh-shekar commented Feb 12, 2018

jfoy commented Mar 5, 2018 •

edited

Loading

bobbytables commented Mar 5, 2018

jfoy commented Mar 5, 2018

bobbytables commented Mar 5, 2018 •

edited

Loading

RyPeck commented Mar 6, 2018

jfoy commented Mar 7, 2018

stephbu commented Mar 30, 2018 •

edited

Loading

jfoy commented Mar 30, 2018

zhouhaibing089 commented Mar 30, 2018

jfoy commented Mar 30, 2018 •

edited

Loading

obeattie commented Mar 31, 2018

henderjm commented Apr 9, 2018

jfoy commented Jun 5, 2018 •

edited

Loading

Eliminate hangs/throttling of node heartbeat #52176

Eliminate hangs/throttling of node heartbeat #52176

Conversation

liggitt commented Sep 8, 2017 • edited Loading

liggitt commented Sep 8, 2017

smarterclayton Sep 8, 2017

Choose a reason for hiding this comment

smarterclayton commented Sep 8, 2017

ironcladlou Sep 8, 2017

Choose a reason for hiding this comment

liggitt commented Sep 9, 2017

ironcladlou Sep 11, 2017

Choose a reason for hiding this comment

ironcladlou commented Sep 11, 2017

derekwaynecarr commented Sep 11, 2017

derekwaynecarr commented Sep 11, 2017

derekwaynecarr commented Sep 11, 2017

RyPeck commented Dec 15, 2017

liggitt commented Dec 17, 2017

gtie commented Dec 21, 2017

vboginskey commented Jan 3, 2018

liggitt commented Jan 3, 2018

vboginskey commented Jan 3, 2018

vboginskey commented Jan 4, 2018

cbonte commented Jan 19, 2018

lokesh-shekar commented Feb 12, 2018

jfoy commented Mar 5, 2018 • edited Loading

Intent:

Procedure:

Summary:

bobbytables commented Mar 5, 2018

jfoy commented Mar 5, 2018

bobbytables commented Mar 5, 2018 • edited Loading

RyPeck commented Mar 6, 2018

jfoy commented Mar 7, 2018

stephbu commented Mar 30, 2018 • edited Loading

jfoy commented Mar 30, 2018

zhouhaibing089 commented Mar 30, 2018

jfoy commented Mar 30, 2018 • edited Loading

obeattie commented Mar 31, 2018

henderjm commented Apr 9, 2018

jfoy commented Jun 5, 2018 • edited Loading

liggitt commented Sep 8, 2017 •

edited

Loading

jfoy commented Mar 5, 2018 •

edited

Loading

bobbytables commented Mar 5, 2018 •

edited

Loading

stephbu commented Mar 30, 2018 •

edited

Loading

jfoy commented Mar 30, 2018 •

edited

Loading

jfoy commented Jun 5, 2018 •

edited

Loading