Skip to content
This repository has been archived by the owner on Sep 7, 2022. It is now read-only.

Pods with volumes stuck in ContainerCreating after cluster node is powered off in vCenter #248

Closed
BaluDontu opened this issue Aug 20, 2017 · 3 comments
Assignees

Comments

@BaluDontu
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
One of the cluster worker nodes was powered off in vCenter.
Pods running on this node have been rescheduled on different nodes but got stuck in ContainerCreating. It failed to attach the volume on the new node with error "Multi-Attach error for volume pvc-xxx, Volume is already exclusively attached to one node and can't be attached to another" and hence the application running in the pod has no data available because the volume is not attached to the new node. The volume is still attached to powered off and no attempt is made to detach the volume by attach/detach controller. Hence attach the volume on the new node failed with error "Multi-Attach error". It's stuck for 6 minutes until attach/detach controller forcefully tried to detach the volume on the powered off node. After the end of 6 minutes when volume is detached on powered off node, the volume is now successfully attached on the new node and application has now the data available.

What you expected to happen:
Pod should be started correctly on a different node, with volumes attached to it very quickly. For an application to be down for 6 minutes is very bad.

How to reproduce it (as minimally and precisely as possible):

  • use k8s 1.7.0+ with vSphere cloud provider
  • Delete the node from the cluster by powering off the node from vCenter.

Anything else we need to know?:

Below are the logs from controller manager.

{"log":"I0818 23:27:01.933677       1 controller_utils.go:285] Recording status change NodeNotReady event message for node node2\n","stream":"stderr","time":"2017-08-18T23:27:01.933908764Z"}
{"log":"I0818 23:27:01.934012       1 controller_utils.go:203] Update ready status of pods on node [node2]\n","stream":"stderr","time":"2017-08-18T23:27:01.934110529Z"}
{"log":"I0818 23:27:01.934384       1 event.go:218] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node2\", UID:\"14b8f7fb-846c-11e7-addd-0050569ccb52\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'NodeNotReady' Node node2 status is now: NodeNotReady\n","stream":"stderr","time":"2017-08-18T23:27:01.93448004Z"}
{"log":"I0818 23:27:01.939006       1 controller_utils.go:220] Updating ready status of pod redis-master-2401852951-w92hb to false\n","stream":"stderr","time":"2017-08-18T23:27:01.939064356Z"}
{"log":"I0818 23:27:01.944968       1 controller_utils.go:220] Updating ready status of pod heapster-v1.2.0-1794790591-94z4s to false\n","stream":"stderr","time":"2017-08-18T23:27:01.945054955Z"}
{"log":"I0818 23:27:01.956548       1 controller_utils.go:220] Updating ready status of pod kube-proxy-1lq8f to false\n","stream":"stderr","time":"2017-08-18T23:27:01.956639977Z"}
{"log":"W0818 23:27:02.043661       1 vsphere.go:625] VM node2, is not in poweredOn state\n","stream":"stderr","time":"2017-08-18T23:27:02.043753193Z"}
{"log":"I0818 23:27:02.043759       1 controller_utils.go:274] Recording Deleting Node node2 because it's not present according to cloud provider event message for node node2\n","stream":"stderr","time":"2017-08-18T23:27:02.043785273Z"}
{"log":"I0818 23:27:02.044235       1 event.go:218] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node2\", UID:\"14b8f7fb-846c-11e7-addd-0050569ccb52\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingNode' Node node2 event: Deleting Node node2 because it's not present according to cloud provider\n","stream":"stderr","time":"2017-08-18T23:27:02.044320987Z"}
{"log":"I0818 23:27:02.050426       1 attach_detach_controller.go:465] error removing node \"node2\" from desired-state-of-world: failed to delete node \"node2\" from list of nodes managed by attach/detach controller--the node still contains 1 volumes in its list of volumes to attach\n","stream":"stderr","time":"2017-08-18T23:27:02.050519226Z"}
{"log":"I0818 23:27:06.964846       1 gc_controller.go:162] Found orphaned Pod redis-master-2401852951-w92hb assigned to the Node node2. Deleting.\n","stream":"stderr","time":"2017-08-18T23:27:06.96761137Z"}
{"log":"I0818 23:27:06.964850       1 gc_controller.go:62] PodGC is force deleting Pod: default:redis-master-2401852951-w92hb\n","stream":"stderr","time":"2017-08-18T23:27:06.967620391Z"}
{"log":"I0818 23:27:06.987612       1 replica_set.go:455] Too few \"default\"/\"redis-master-2401852951\" replicas, need 1, creating 1\n","stream":"stderr","time":"2017-08-18T23:27:06.987687983Z"}
{"log":"I0818 23:27:07.005200       1 gc_controller.go:166] Forced deletion of orphaned Pod redis-master-2401852951-w92hb succeeded\n","stream":"stderr","time":"2017-08-18T23:27:07.005265456Z"}
{"log":"I0818 23:27:07.027452       1 event.go:218] Event(v1.ObjectReference{Kind:\"ReplicaSet\", Namespace:\"default\", Name:\"redis-master-2401852951\", UID:\"3f1729f7-846c-11e7-addd-0050569ccb52\", APIVersion:\"extensions\", ResourceVersion:\"909\", FieldPath:\"\"}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: redis-master-2401852951-wcx6h\n","stream":"stderr","time":"2017-08-18T23:27:07.027541497Z"}
{"log":"I0818 23:27:07.044256       1 nodecontroller.go:553] NodeController observed a Node deletion: node2\n","stream":"stderr","time":"2017-08-18T23:27:07.044322304Z"}
{"log":"I0818 23:27:07.044362       1 controller_utils.go:274] Recording Removing Node node2 from NodeController event message for node node2\n","stream":"stderr","time":"2017-08-18T23:27:07.048265998Z"}
{"log":"I0818 23:27:07.044485       1 event.go:218] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node2\", UID:\"14b8f7fb-846c-11e7-addd-0050569ccb52\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RemovingNode' Node node2 event: Removing Node node2 from NodeController\n","stream":"stderr","time":"2017-08-18T23:27:07.048273963Z"}
2017-08-18T23:27:07.048273963Z"}
{"log":"W0818 23:27:07.054184       1 reconciler.go:262] Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.05423019Z"}
{"log":"I0818 23:27:07.054279       1 event.go:218] Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"redis-master-2401852951-wcx6h\", UID:\"bff886d1-846c-11e7-addd-0050569ccb52\", APIVersion:\"v1\", ResourceVersion:\"945\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.054350162Z"}
{"log":"W0818 23:27:07.054184       1 reconciler.go:262] Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.05423019Z"}
{"log":"I0818 23:27:07.054279       1 event.go:218] Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"redis-master-2401852951-wcx6h\", UID:\"bff886d1-846c-11e7-addd-0050569ccb52\", APIVersion:\"v1\", ResourceVersion:\"945\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.054350162Z"}
{"log":"W0818 23:27:07.154442       1 reconciler.go:262] Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.154554127Z"}
{"log":"I0818 23:27:07.155116       1 event.go:218] Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"redis-master-2401852951-wcx6h\", UID:\"bff886d1-846c-11e7-addd-0050569ccb52\", APIVersion:\"v1\", ResourceVersion:\"945\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.15522094Z"}
{"log":"W0818 23:27:07.254711       1 reconciler.go:262] Multi-Attach error for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" Volume is already exclusively attached to one node and can't be attached to another\n","stream":"stderr","time":"2017-08-18T23:27:07.254837081Z"}

**At the end of 6 minutes, you will see attach/detach controller attempts a force delete after which attach volume on new node succeeds.**

**{"log":"W0818 23:33:07.140938**       1 reconciler.go:228] attacherDetacher.DetachVolume started for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") on node \"node2\" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching\n","stream":"stderr","time":"2017-08-18T23:33:07.140998742Z"}
{"log":"I0818 23:33:07.141103       1 attacher.go:231]   Detach in attacher.go with deviceMountPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\"\n","stream":"stderr","time":"2017-08-18T23:33:07.141180226Z"}
{"log":"I0818 23:33:07.141201       1 attacher.go:233]   Calling DiskIsAttached in attacher.go with volPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\"\n","stream":"stderr","time":"2017-08-18T23:33:07.141248186Z"}
{"log":"I0818 23:33:07.141303       1 vsphere.go:1003]   DiskIsAttached with volPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\" and nodeName: \"node2\"\n","stream":"stderr","time":"2017-08-18T23:33:07.141325846Z"}
{"log":"I0818 23:33:07.413977       1 vsphere.go:1041]   DiskIsAttached result with volPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\" and nodeName: \"node2\" is isAttached: true and err: \u003cnil\u003e\n","stream":"stderr","time":"2017-08-18T23:33:07.414101776Z"}
{"log":"I0818 23:33:07.414265       1 attacher.go:250]   Calling DetachDisk in attacher.go with volPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\"\n","stream":"stderr","time":"2017-08-18T23:33:07.414329139Z"}
{"log":"I0818 23:33:07.414359       1 vsphere.go:1225]   DetachDisk with volPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\" and nodeName: \"node2\"\n","stream":"stderr","time":"2017-08-18T23:33:07.41448605Z"}
{"log":"I0818 23:33:08.855287       1 vsphere.go:1282]   DetachDisk result with volPath: \"[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\" and nodeName: \"node2\" is err: \u003cnil\u003e\n","stream":"stderr","time":"2017-08-18T23:33:08.855487181Z"}
{"log":"I0818 23:33:08.855387       1 operation_generator.go:348] DetachVolume.Detach succeeded for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") on node \"node2\" \n","stream":"stderr","time":"2017-08-18T23:33:08.855531178Z"}
{"log":"I0818 23:33:08.945540       1 reconciler.go:272] attacherDetacher.AttachVolume started for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" \n","stream":"stderr","time":"2017-08-18T23:33:08.945844415Z"}
{"log":"I0818 23:33:11.951325       1 reconciler.go:272] attacherDetacher.AttachVolume started for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" \n","stream":"stderr","time":"2017-08-18T23:33:11.951450457Z"}
{"log":"I0818 23:33:12.581761       1 operation_generator.go:275] AttachVolume.Attach succeeded for volume \"pvc-2f53fe7e-846c-11e7-addd-0050569ccb52\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-2f53fe7e-846c-11e7-addd-0050569ccb52.vmdk\") from node \"node3\" \n","stream":"stderr","time":"2017-08-18T23:33:12.581846373Z"}

The issue is because of a recent change in kubernetes release 1.7 - kubernetes#45346 which handles "Multi-Attach error for volume"

I think this is a very serious problems for us, as in case of vsphere volumes are not detached automatically when the node is powered off as in GCE or AWS.

@tusharnt @divyenpatel @luomiao @rohitjogvmw

@BaluDontu
Copy link
Author

Kubernetes issue tracking this - kubernetes#50944

@BaluDontu
Copy link
Author

BaluDontu commented Aug 23, 2017

The PR - kubernetes#50944 on kubernetes is merged. I have created a new cherry-pick request onto 1.7 branch - kubernetes#51226.

I will close this PR once cherry-pick onto 1.7 release branch is merged.

@BaluDontu
Copy link
Author

Cherry-pick into 1.7 release branch kubernetes#51226 merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants