Allow attach of volumes to multiple nodes for vSphere #51066

BaluDontu · 2017-08-22T00:40:14Z

This is a fix for issue #50944 which doesn't allow a volume to be attached to a new node after the node is powered off where the volume was previously attached.

Current behaviour:
One of the cluster worker nodes was powered off in vCenter.
Pods running on this node have been rescheduled on different nodes but got stuck in ContainerCreating. It failed to attach the volume on the new node with error "Multi-Attach error for volume pvc-xxx, Volume is already exclusively attached to one node and can't be attached to another" and hence the application running in the pod has no data available because the volume is not attached to the new node. Since the volume is still attached to powered off node, any attempt to attach the volume on the new node failed with error "Multi-Attach error". It's stuck for 6 minutes until attach/detach controller forcefully tried to detach the volume on the powered off node. After the end of 6 minutes when volume is detached on powered off node, the volume is now successfully attached on the new node and application has now the data available.

What is expected to happen:
I would want the attach/detach controller to go ahead with the attach of the volume on new node where the pod got provisioned instead of waiting for the volume to be detached on the powered off node. It is ok to eventually delete the volume on the powered off node after 6 minutes. This way the application downtime is low and pods are up as soon as possible.

The current fix ignore, vSphere volumes/persistent volume to check for multi-attach scenario in attach/detach controller.

@jingxu97 @saad-ali : Can you please take a look at it.

@tusharnt @divyenpatel @rohitjogvmw @luomiao

Allow attach of volumes to multiple nodes for vSphere

k8s-ci-robot · 2017-08-22T00:40:21Z

Hi @BaluDontu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

BaluDontu · 2017-08-22T00:41:03Z

/assign @saad-ali

saad-ali · 2017-08-22T00:43:38Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

@@ -143,11 +143,20 @@ func (rc *reconciler) isMultiAttachForbidden(volumeSpec *volume.Spec) bool {
 			volumeSpec.Volume.Cinder != nil {
 			return true
 		}
+		// Check for volume types which do not fail when trying to multi-attach
+		if volumeSpec.Volume.VsphereVolume != nil {


This is not necessary since the default behavior should return false. I don't want this to become a list of every plugin that exists.

I missed it. Will remove it.

saad-ali · 2017-08-22T00:46:06Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+		if volumeSpec.PersistentVolume.Spec.VsphereVolume != nil {
+			return false
+		}
+
 		if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {


I would be ok being more aggressive here and replacing the volumeSpec.PersistentVolume.Spec.AccessModes altogether with a white list (as above).

Shouldn't we want to check access modes for other cloud providers. Other cloud providers might use ReadWriteMany, ReadOnlyMany access mode for volumes. In this case attach/detach controller should allow for "multi-attach". Removing the entire check for access mode might cause a problem for other providers.

if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 { // No access mode specified so we don't know for sure. Let the attacher fail if needed return false } // check if this volume is allowed to be attached to multiple PODs/nodes, if yes, return false for _, accessMode := range volumeSpec.PersistentVolume.Spec.AccessModes { if accessMode == v1.ReadWriteMany || accessMode == v1.ReadOnlyMany { return false } }

Discussed offline. Ok with leaving this as is for now, and modifying it is also an issue for other storage plugins.

saad-ali · 2017-08-22T00:46:51Z

/ok-to-test

saad-ali · 2017-08-22T01:11:43Z

/lgtm
/approve

Please modify the first comment on this page to include a release note.

saad-ali · 2017-08-22T01:13:22Z

@wojtek-t for 1.7 cherry pick approval

BaluDontu · 2017-08-22T04:51:39Z

/retest

wojtek-t · 2017-08-22T07:47:45Z

/retest

fejta-bot · 2017-08-22T13:20:28Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

BaluDontu · 2017-08-23T01:15:14Z

/retest

BaluDontu · 2017-08-23T16:46:15Z

@saad-ali @jingxu97 : Can you provide an lgtm on this PR.

jingxu97 · 2017-08-23T18:18:38Z

/lgtm.
One question after VM powered off, how long did the pods get recreated to other nodes?

BaluDontu · 2017-08-23T18:34:57Z

@jingxu97 : It took a minute or so for the kubernetes to recognize that the VM powered off.

After that, it deletes all the pods on the powered off node (node1 in this case) and immediately starts the volume attach on new node (node3 in this case) after which the pod became available. It almost took around 40-50 secs to have pod available.

Now node1 is powered off. Node controller recognizes this and eventually deletes the orphaned pod.

{"log":"W0821 19:27:12.632675       1 vsphere.go:1446] VM node1, is not in poweredOn state\n","stream":"stderr","time":"2017-08-21T19:27:12.632845421Z"}
{"log":"I0821 19:27:33.459577       1 controller_utils.go:285] Recording status change NodeNotReady event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:33.459956784Z"}{"log":"I0821 19:27:33.459653       1 controller_utils.go:203] Update ready status of pods on node [node1]\n","stream":"stderr","time":"2017-08-21T19:27:33.459975458Z"}
{"log":"I0821 19:27:33.459791       1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'NodeNotReady' Node node1 status is now: NodeNotReady\n","stream":"stderr","time":"2017-08-21T19:27:33.459981155Z"}
{"log":"W0821 19:27:33.553165       1 vsphere.go:611] VM node1, is not in poweredOn state\n","stream":"stderr","time":"2017-08-21T19:27:33.553316921Z"}
{"log":"I0821 19:27:33.553213       1 nodecontroller.go:752] Deleting node (no longer present in cloud provider): node1\n","stream":"stderr","time":"2017-08-21T19:27:33.553369294Z"}
{"log":"I0821 19:27:33.553220       1 controller_utils.go:274] Recording Deleting Node node1 because it's not present according to cloud provider event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:33.55337406Z"}
{"log":"I0821 19:27:33.553384       1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingNode' Node node1 event: Deleting Node node1 because it's not present according to cloud provider\n","stream":"stderr","time":"2017-08-21T19:27:33.553443403Z"}
{"log":"I0821 19:27:38.553459       1 nodecontroller.go:623] NodeController observed a Node deletion: node1\n","stream":"stderr","time":"2017-08-21T19:27:38.553720811Z"}
{"log":"I0821 19:27:38.553518       1 controller_utils.go:274] Recording Removing Node node1 from NodeController event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:38.553760747Z"}
{"log":"I0821 19:27:38.553811       1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RemovingNode' Node node1 event: Removing Node node1 from NodeController\n","stream":"stderr","time":"2017-08-21T19:27:38.554956045Z"}
{"log":"I0821 19:27:53.531262       1 gc_controller.go:157] Found orphaned Pod redis-server-2448813087-m62zw assigned to the Node node1. Deleting.\n","stream":"stderr","time":"2017-08-21T19:27:53.531376237Z"}
{"log":"I0821 19:27:53.531267       1 gc_controller.go:62] PodGC is force deleting Pod: default:redis-server-2448813087-m62zw\n","stream":"stderr","time":"2017-08-21T19:27:53.53138159Z"}
{"log":"I0821 19:27:53.557972       1 gc_controller.go:161] Forced deletion of orphaned Pod redis-server-2448813087-m62zw succeeded\n","stream":"stderr","time":"2017-08-21T19:27:53.558039494Z"}

Immediately started attaching volume on the new node - node3 in this case and it succeeded.

{"log":"I0821 19:27:56.771960       1 reconciler.go:208] Started AttachVolume for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\" to node \"node3\"\n","stream":"stderr","time":"2017-08-21T19:27:56.772160877Z"}
{"log":"I0821 19:27:57.804622       1 operation_generator.go:290] AttachVolume.Attach succeeded for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\" (spec.Name: \"pvc-208563c0-86a5-11e7-bff5-005056bd2502\") from node \"node3\".\n","stream":"stderr","time":"2017-08-21T19:27:57.804762946Z"}
{"log":"I0821 19:27:57.880295       1 node_status_updater.go:143] Updating status for node \"node3\" succeeded. patchBytes: \"{\\\"status\\\":{\\\"volumesAttached\\\":[{\\\"devicePath\\\":\\\"/dev/disk/by-id/wwn-0x6000c291fb5e79d81ea1958e888c9d37\\\",\\\"name\\\":\\\"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\\\"}]}}\" VolumesAttached: [{kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk /dev/disk/by-id/wwn-0x6000c291fb5e79d81ea1958e888c9d37}]\n","stream":"stderr","time":"2017-08-21T19:27:57.880397598Z"}

BaluDontu · 2017-08-23T18:35:40Z

@jingxu97 : PR still says lgtm label is not present. Can you try it again?

k8s-github-robot · 2017-08-23T20:18:01Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-08-23T21:32:31Z

Automatic merge from submit-queue

@divyenpatel

…-upstream-release-1.7 Automatic merge from submit-queue Automated cherry pick of #51066 upstream release 1.7 Cherry pick of #51066 on release-1.7: #51066: Allow attach of volumes to multiple nodes for vSphere @rohitjogvmw @divyenpatel @tusharnt @luomiao ```release-note vSphere: Allow attach of volumes to multiple nodes for vSphere. ```

k8s-cherrypick-bot · 2017-08-25T13:05:25Z

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Automatic merge from submit-queue (batch tested with PRs 67745, 67432, 67569, 67825, 67943). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. Fix VMWare VM freezing bug by reverting #51066 **What this PR does / why we need it**: kube-controller-manager, VSphere specific: When the controller tries to attach a Volume to Node A that is already attached to Node B, Node A freezes until the volume is attached. Kubernetes continues to try to attach the volume as it thinks that it's 'multi-attachable' when it's not. #51066 is the culprit. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes vmware-archive#500 / vmware-archive#502 (same issue) **Special notes for your reviewer**: - Repro: Vsphere installation, any k8s version from 1.8 and above, pod with attached PV/PVC/VMDK: 1. cordon the node which the pod is in 2. `kubectl delete po/[pod] --force --grace-period=0` 3. the pod is immediately rescheduled to a new node. Grab the new node from a `kubectl describe [pod]` and attempt to Ping it or SSH into it. 4. you can see that pings/ssh fail to reach the new node. `kubectl get node` shows it as 'NotReady'. New node is frozen until the volume is attached - usually 1 minute freeze for 1 volume in a low-load cluster, and many minutes more with higher loads and more volumes involved. - Patch verification: Tested a custom patched 1.9.10 kube-controller-manager with #51066 reverted and the above bug is resolved - can't repro it anymore. New node doesn't freeze at all, and attaching happens quite quickly, in a few seconds. **Release note**: ``` Fix VSphere VM Freezing bug by reverting #51066 ```

…5-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 Cherry pick of #67825 on release-1.10. #67825: Fix VMWare VM freezing bug by reverting #51066

…5-upstream-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 Cherry pick of #67825 on release-1.9. #67825: Fix VMWare VM freezing bug by reverting #51066

…5-upstream-release-1.11 Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

redbaron · 2018-10-09T10:06:04Z

When this change was applied/discussed, what vsphere integration expected to do in the "normal" case? when pod is restarted on another node, this changes makes volume on a new node to be attached first, before detach on a previous node is executed.

We see kernel panics where volume is attached before detaching it on another node, so I am asking to find out if there is a bug lurking elsewhere even though this particular change was reverted. Looking at the discussion, it seems that normal case would be handled by vsphere cloud provider, but what exactly is expected behaviour here? Should it report error when asked to attach volume without attempting to attach if volume is attached to a healthy node? Should it detach as part of attach?

alvaroaleman · 2018-10-09T11:44:11Z

To clarify this for future users who may stumble on this: The change made here got reverted in #67825. The Kubernetes versions that do not include the change introduced here anymore are:

 - `v1.10.8` (and later)
 - `v1.11.4` (and later)
 - `v1.12+

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017

k8s-github-robot assigned pmorie and rootfs Aug 22, 2017

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed labels Aug 22, 2017

k8s-ci-robot assigned saad-ali Aug 22, 2017

saad-ali suggested changes Aug 22, 2017

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017

BaluDontu force-pushed the MultiAttachVolumeIssueVsphere branch from 82ce983 to cfdff1a Compare August 22, 2017 01:05

Multi-Attach volume fix for vSphere

cfdff1a

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2017

k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Aug 22, 2017

saad-ali added this to the v1.7 milestone Aug 22, 2017

saad-ali added the cherrypick-candidate label Aug 22, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2017

saad-ali approved these changes Aug 22, 2017

View reviewed changes

jingxu97 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 23, 2017

k8s-github-robot merged commit b75d423 into kubernetes:master Aug 23, 2017

BaluDontu mentioned this pull request Aug 23, 2017

Automated cherry pick of #51066 upstream release 1.7 #51226

Merged

wojtek-t added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 24, 2017

k8s-cherrypick-bot removed the cherrypick-candidate label Aug 25, 2017

jingxu97 mentioned this pull request Sep 1, 2017

Pods with volumes stuck in ContainerCreating after cluster node is powered off in vCenter #50944

Closed

BaluDontu deleted the MultiAttachVolumeIssueVsphere branch September 5, 2017 23:09

bradbeam mentioned this pull request Aug 23, 2018

PV attachment causes VM to hang vmware-archive/kubernetes-archived#502

Closed

nikopen mentioned this pull request Aug 24, 2018

Fix VMWare VM freezing bug by reverting #51066 #67825

Merged

k8s-ci-robot added a commit that referenced this pull request Sep 10, 2018

Merge pull request #68058 from nikopen/automated-cherry-pick-of-#6782…

f30a74e

…5-upstream-release-1.11 Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow attach of volumes to multiple nodes for vSphere #51066

Allow attach of volumes to multiple nodes for vSphere #51066

BaluDontu commented Aug 22, 2017 •

edited

Loading

k8s-ci-robot commented Aug 22, 2017

BaluDontu commented Aug 22, 2017

saad-ali Aug 22, 2017

BaluDontu Aug 22, 2017

saad-ali Aug 22, 2017

BaluDontu Aug 22, 2017

saad-ali Aug 22, 2017

saad-ali commented Aug 22, 2017

saad-ali commented Aug 22, 2017

saad-ali commented Aug 22, 2017

BaluDontu commented Aug 22, 2017

wojtek-t commented Aug 22, 2017

fejta-bot commented Aug 22, 2017

BaluDontu commented Aug 23, 2017

BaluDontu commented Aug 23, 2017

jingxu97 commented Aug 23, 2017

BaluDontu commented Aug 23, 2017

BaluDontu commented Aug 23, 2017

k8s-github-robot commented Aug 23, 2017

k8s-github-robot commented Aug 23, 2017

k8s-cherrypick-bot commented Aug 25, 2017

redbaron commented Oct 9, 2018

alvaroaleman commented Oct 9, 2018

Allow attach of volumes to multiple nodes for vSphere #51066

Allow attach of volumes to multiple nodes for vSphere #51066

Conversation

BaluDontu commented Aug 22, 2017 • edited Loading

k8s-ci-robot commented Aug 22, 2017

BaluDontu commented Aug 22, 2017

saad-ali Aug 22, 2017

Choose a reason for hiding this comment

BaluDontu Aug 22, 2017

Choose a reason for hiding this comment

saad-ali Aug 22, 2017

Choose a reason for hiding this comment

BaluDontu Aug 22, 2017

Choose a reason for hiding this comment

saad-ali Aug 22, 2017

Choose a reason for hiding this comment

saad-ali commented Aug 22, 2017

saad-ali commented Aug 22, 2017

saad-ali commented Aug 22, 2017

BaluDontu commented Aug 22, 2017

wojtek-t commented Aug 22, 2017

fejta-bot commented Aug 22, 2017

BaluDontu commented Aug 23, 2017

BaluDontu commented Aug 23, 2017

jingxu97 commented Aug 23, 2017

BaluDontu commented Aug 23, 2017

BaluDontu commented Aug 23, 2017

k8s-github-robot commented Aug 23, 2017

k8s-github-robot commented Aug 23, 2017

k8s-cherrypick-bot commented Aug 25, 2017

redbaron commented Oct 9, 2018

alvaroaleman commented Oct 9, 2018

BaluDontu commented Aug 22, 2017 •

edited

Loading