Pods with volumes stuck in ContainerCreating after cluster node is deleted from OpenStack

**Is this a BUG REPORT or FEATURE REQUEST?**:
/kind bug

**What happened**:
One of the cluster worker nodes was deleted from OpenStack. 
Pods running on this node have been rescheduled on different nodes but got stuck in ContainerCreating. It's stuck for 20+ minutes until action like restarting controller manager is taken (it can't reconcile without manual intervention). See at the end for actions that can fix it.

**What you expected to happen**:
Pod should be started correctly on a different node, with volumes attached to it.

**How to reproduce it (as minimally and precisely as possible)**:
- use k8s 1.7.0 with OpenStack provider
- Deploy https://gist.github.com/kars7e/2e4597e854fb952b098b5470120bebdf to the cluster
- Delete the node from the cluster by deleting the backing instance from OpenStack

**Anything else we need to know?**:
The underlying issue is that Cinder volumes are getting dettached when instance is deleted, but k8s is not registering this fact, and is throwing:

 `Multi-Attach error for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c" Volume is already exclusively attached to one node and can't be attached to another`

It seems that manager attempts the detach, but is not able to handle the fact that volume is in available state (k8s-node-2 is the node that has been deleted): 

`{"log":"E0805 19:36:33.196285       6 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\\\"\" failed. No retries permitted until 2017-08-05 19:37:37.196173574 +0000 UTC (durationBeforeRetry 1m4s). Error: DetachVolume.Detach failed for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") on node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" : can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196396348Z"}`

After inspecting the pod, following events are registered (note, this pod is eventually started because controller manager has been rebooted):
```
Events:
  FirstSeen	LastSeen	Count	From								SubObjectPath		Type		Reason		  	Message
  ---------	--------	-----	----								-------------		--------	------			-------
  11m		11m		1	default-scheduler									Normal		Scheduled	  	Successfully assigned webapp-staging-3564021476-tq2dc to k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515
  11m		11m		1	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515				Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-bw55f"
  11m		2m		5144	attachdetach										Warning		FailedAttachVolume	Multi-Attach error for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c" Volume is already exclusively attached to one node and can't be attached to another
  9m		2m		4	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515				Warning		FailedMount	  	Unable to mount volumes for pod "webapp-staging-3564021476-tq2dc_default(63941f31-7a14-11e7-bd74-fa163eae2160)": timeout expired waiting for volumes to attach/mount for pod "default"/"webapp-staging-3564021476-tq2dc". list of unattached/unmounted volumes=[mypvc]
  9m		2m		4	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515				Warning		FailedSync	  	Error syncing pod
  1m		1m		1	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515				Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c"
  1m		1m		1	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515	spec.containers{nginx}	Normal		Pulled		  	Container image "nginx:1.7.9" already present on machine
  1m		1m		1	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515	spec.containers{nginx}	Normal		Created		  	Created container
  1m		1m		1	kubelet, k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515	spec.containers{nginx}	Normal		Started
```

Also an excerpt from controller manager log (with --v=4):

```{"log":"I0805 19:36:33.072522       6 openstack.go:590] Blockstorage API version probing has found a suitable SUPPORTED api version: v2\n","stream":"stderr","time":"2017-08-05T19:36:33.072564943Z"}
{"log":"I0805 19:36:33.156440       6 deployment_controller.go:562] Started syncing deployment \"default/webapp-staging\" (2017-08-05 19:36:33.156430897 +0000 UTC)\n","stream":"stderr","time":"2017-08-05T19:36:33.156586574Z"}
{"log":"I0805 19:36:33.157358       6 progress.go:231] Queueing up deployment \"webapp-staging\" for a progress check now\n","stream":"stderr","time":"2017-08-05T19:36:33.157472884Z"}
{"log":"I0805 19:36:33.157409       6 deployment_controller.go:564] Finished syncing deployment \"default/webapp-staging\" (954.263&micro;s)\n","stream":"stderr","time":"2017-08-05T19:36:33.157484805Z"}
{"log":"E0805 19:36:33.196088       6 openstack_volumes.go:263] can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196206222Z"}
{"log":"E0805 19:36:33.196104       6 attacher.go:394] Error detaching volume \"9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\" from node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\": can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196226294Z"}
{"log":"I0805 19:36:33.196143       6 actual_state_of_world.go:478] Add new node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" to nodesToUpdateStatusFor\n","stream":"stderr","time":"2017-08-05T19:36:33.196230931Z"}
{"log":"I0805 19:36:33.196157       6 actual_state_of_world.go:486] Report volume \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\" as attached to node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\"\n","stream":"stderr","time":"2017-08-05T19:36:33.196234322Z"}
{"log":"E0805 19:36:33.196285       6 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\\\"\" failed. No retries permitted until 2017-08-05 19:37:37.196173574 +0000 UTC (durationBeforeRetry 1m4s). Error: DetachVolume.Detach failed for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") on node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" : can not detach volume kubernetes-dynamic-pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c, its status is available.\n","stream":"stderr","time":"2017-08-05T19:36:33.196396348Z"}
{"log":"I0805 19:36:33.212211       6 node_status_updater.go:76] Could not update node status. Failed to find node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" in NodeInformer cache. Error: 'node \"k8s-node-2-31217f04-941c-48f2-b36e-8a97a3bf7515\" not found'\n","stream":"stderr","time":"2017-08-05T19:36:33.212356492Z"}
{"log":"W0805 19:36:33.212444       6 reconciler.go:262] Multi-Attach error for volume \"pvc-7da59477-7a13-11e7-a1c3-fa163ec6b87c\" (UniqueName: \"kubernetes.io/cinder/9dd4110b-e9f2-4eba-a2b5-22b6082b2c1b\") from node \"k8s-node-1-31217f04-941c-48f2-b36e-8a97a3bf7515\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-05T19:36:33.2
```

*Note:* Following operations will resolve this issue:
- Restarting active controller manager
- attaching the volume in question to random instance
- deleting & recreating the pod.

**Environment**:
- Kubernetes version (use `kubectl version`):
```
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d70ffb940d12c648df2b6647580c150b8f113704", GitTreeState:"clean", BuildDate:"2017-08-04T07:14:35Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
```
- Cloud provider or hardware configuration**:
OpenStack Mitaka
- OS (e.g. from /etc/os-release):
Ubuntu 16.04.2 LTS (Xenial Xerus)
- Kernel (e.g. `uname -a`):
Linux k8s-master-1-31217f04-941c-48f2-b36e-8a97a3bf7515 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
kargo (kubespray)
- Others:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pods with volumes stuck in ContainerCreating after cluster node is deleted from OpenStack #50200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pods with volumes stuck in ContainerCreating after cluster node is deleted from OpenStack #50200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions