Description
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
One of the cluster worker nodes was deleted from OpenStack.
Pods running on this node have been rescheduled on different nodes but got stuck in ContainerCreating. It's stuck for 20+ minutes until action like restarting controller manager is taken (it can't reconcile without manual intervention). See at the end for actions that can fix it.
What you expected to happen:
Pod should be started correctly on a different node, with volumes attached to it.
How to reproduce it (as minimally and precisely as possible):
- use k8s 1.7.0 with OpenStack provider
- Deploy https://gist.github.com/kars7e/2e4597e854fb952b098b5470120bebdf to the cluster
- Delete the node from the cluster by deleting the backing instance from OpenStack
Anything else we need to know?:
The underlying issue is that Cinder volumes are getting dettached when instance is deleted, but k8s is not registering this fact, and is throwing:
Multi-Attach error for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c" Volume is already exclusively attached to one node and can't be attached to another
It seems that manager attempts the detach, but is not able to handle the fact that volume is in available state (k8s-node-2 is the node that has been deleted):
{"log":"E0805 19:36:33.196285 6 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\\\"\" failed. No retries permitted until 2017-08-05 19:37:37.196173574 +0000 UTC (durationBeforeRetry 1m4s). Error: DetachVolume.Detach failed for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") on node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" : can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196396348Z"}
After inspecting the pod, following events are registered (note, this pod is eventually started because controller manager has been rebooted):
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
11m 11m 1 default-scheduler Normal Scheduled Successfully assigned webapp-staging-3564021476-tq2dc to k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515
11m 11m 1 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-bw55f"
11m 2m 5144 attachdetach Warning FailedAttachVolume Multi-Attach error for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c" Volume is already exclusively attached to one node and can't be attached to another
9m 2m 4 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 Warning FailedMount Unable to mount volumes for pod "webapp-staging-3564021476-tq2dc_default(63941f31-7a14-11e7-bd74-fa163eae2160)": timeout expired waiting for volumes to attach/mount for pod "default"/"webapp-staging-3564021476-tq2dc". list of unattached/unmounted volumes=[mypvc]
9m 2m 4 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 Warning FailedSync Error syncing pod
1m 1m 1 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c"
1m 1m 1 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 spec.containers{nginx} Normal Pulled Container image "nginx:1.7.9" already present on machine
1m 1m 1 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 spec.containers{nginx} Normal Created Created container
1m 1m 1 kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515 spec.containers{nginx} Normal Started
Also an excerpt from controller manager log (with --v=4):
{"log":"I0805 19:36:33.156440 6 deployment_controller.go:562] Started syncing deployment \"default/webapp-staging\" (2017-08-05 19:36:33.156430897 +0000 UTC)\n","stream":"stderr","time":"2017-08-05T19:36:33.156586574Z"}
{"log":"I0805 19:36:33.157358 6 progress.go:231] Queueing up deployment \"webapp-staging\" for a progress check now\n","stream":"stderr","time":"2017-08-05T19:36:33.157472884Z"}
{"log":"I0805 19:36:33.157409 6 deployment_controller.go:564] Finished syncing deployment \"default/webapp-staging\" (954.263µs)\n","stream":"stderr","time":"2017-08-05T19:36:33.157484805Z"}
{"log":"E0805 19:36:33.196088 6 openstack_volumes.go:263] can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196206222Z"}
{"log":"E0805 19:36:33.196104 6 attacher.go:394] Error detaching volume \"9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\" from node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\": can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196226294Z"}
{"log":"I0805 19:36:33.196143 6 actual_state_of_world.go:478] Add new node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" to nodesToUpdateStatusFor\n","stream":"stderr","time":"2017-08-05T19:36:33.196230931Z"}
{"log":"I0805 19:36:33.196157 6 actual_state_of_world.go:486] Report volume \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\" as attached to node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\"\n","stream":"stderr","time":"2017-08-05T19:36:33.196234322Z"}
{"log":"E0805 19:36:33.196285 6 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\\\"\" failed. No retries permitted until 2017-08-05 19:37:37.196173574 +0000 UTC (durationBeforeRetry 1m4s). Error: DetachVolume.Detach failed for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") on node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" : can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196396348Z"}
{"log":"I0805 19:36:33.212211 6 node_status_updater.go:76] Could not update node status. Failed to find node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" in NodeInformer cache. Error: 'node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" not found'\n","stream":"stderr","time":"2017-08-05T19:36:33.212356492Z"}
{"log":"W0805 19:36:33.212444 6 reconciler.go:262] Multi-Attach error for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") from node \"k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-05T19:36:33.2
Note: Following operations will resolve this issue:
- Restarting active controller manager
- attaching the volume in question to random instance
- deleting & recreating the pod.
Environment:
- Kubernetes version (use
kubectl version
):
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d70ffb940d12c648df2b6647580c150b8f113704", GitTreeState:"clean", BuildDate:"2017-08-04T07:14:35Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration**:
OpenStack Mitaka - OS (e.g. from /etc/os-release):
Ubuntu 16.04.2 LTS (Xenial Xerus) - Kernel (e.g.
uname -a
):
Linux k8s-master-1-31217f04-941c-48f2-b36e-8a97a3bf7515 4.4.0-62-generic add travis integration #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux - Install tools:
kargo (kubespray) - Others: