Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow attach of volumes to multiple nodes for vSphere #51066

Conversation

BaluDontu
Copy link
Contributor

@BaluDontu BaluDontu commented Aug 22, 2017

This is a fix for issue #50944 which doesn't allow a volume to be attached to a new node after the node is powered off where the volume was previously attached.

Current behaviour:
One of the cluster worker nodes was powered off in vCenter.
Pods running on this node have been rescheduled on different nodes but got stuck in ContainerCreating. It failed to attach the volume on the new node with error "Multi-Attach error for volume pvc-xxx, Volume is already exclusively attached to one node and can't be attached to another" and hence the application running in the pod has no data available because the volume is not attached to the new node. Since the volume is still attached to powered off node, any attempt to attach the volume on the new node failed with error "Multi-Attach error". It's stuck for 6 minutes until attach/detach controller forcefully tried to detach the volume on the powered off node. After the end of 6 minutes when volume is detached on powered off node, the volume is now successfully attached on the new node and application has now the data available.

What is expected to happen:
I would want the attach/detach controller to go ahead with the attach of the volume on new node where the pod got provisioned instead of waiting for the volume to be detached on the powered off node. It is ok to eventually delete the volume on the powered off node after 6 minutes. This way the application downtime is low and pods are up as soon as possible.

The current fix ignore, vSphere volumes/persistent volume to check for multi-attach scenario in attach/detach controller.

@jingxu97 @saad-ali : Can you please take a look at it.

@tusharnt @divyenpatel @rohitjogvmw @luomiao

Allow attach of volumes to multiple nodes for vSphere

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @BaluDontu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017
@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed labels Aug 22, 2017
@BaluDontu
Copy link
Contributor Author

/assign @saad-ali

@@ -143,11 +143,20 @@ func (rc *reconciler) isMultiAttachForbidden(volumeSpec *volume.Spec) bool {
volumeSpec.Volume.Cinder != nil {
return true
}
// Check for volume types which do not fail when trying to multi-attach
if volumeSpec.Volume.VsphereVolume != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary since the default behavior should return false. I don't want this to become a list of every plugin that exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it. Will remove it.

if volumeSpec.PersistentVolume.Spec.VsphereVolume != nil {
return false
}

if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok being more aggressive here and replacing the volumeSpec.PersistentVolume.Spec.AccessModes altogether with a white list (as above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we want to check access modes for other cloud providers. Other cloud providers might use ReadWriteMany, ReadOnlyMany access mode for volumes. In this case attach/detach controller should allow for "multi-attach". Removing the entire check for access mode might cause a problem for other providers.

	if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
		// No access mode specified so we don't know for sure. Let the attacher fail if needed
		return false
	}

	// check if this volume is allowed to be attached to multiple PODs/nodes, if yes, return false
	for _, accessMode := range volumeSpec.PersistentVolume.Spec.AccessModes {
		if accessMode == v1.ReadWriteMany || accessMode == v1.ReadOnlyMany {
			return false
		}
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Ok with leaving this as is for now, and modifying it is also an issue for other storage plugins.

@saad-ali
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017
@BaluDontu BaluDontu force-pushed the MultiAttachVolumeIssueVsphere branch from 82ce983 to cfdff1a Compare August 22, 2017 01:05
@saad-ali
Copy link
Member

/lgtm
/approve

Please modify the first comment on this page to include a release note.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2017
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Aug 22, 2017
@saad-ali saad-ali added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Aug 22, 2017
@saad-ali saad-ali added this to the v1.7 milestone Aug 22, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2017
@saad-ali
Copy link
Member

@wojtek-t for 1.7 cherry pick approval

@BaluDontu
Copy link
Contributor Author

/retest

1 similar comment
@wojtek-t
Copy link
Member

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@BaluDontu
Copy link
Contributor Author

/retest

@BaluDontu
Copy link
Contributor Author

@saad-ali @jingxu97 : Can you provide an lgtm on this PR.

@jingxu97
Copy link
Contributor

/lgtm.
One question after VM powered off, how long did the pods get recreated to other nodes?

@BaluDontu
Copy link
Contributor Author

@jingxu97 : It took a minute or so for the kubernetes to recognize that the VM powered off.

After that, it deletes all the pods on the powered off node (node1 in this case) and immediately starts the volume attach on new node (node3 in this case) after which the pod became available. It almost took around 40-50 secs to have pod available.

  1. Now node1 is powered off. Node controller recognizes this and eventually deletes the orphaned pod.
{"log":"W0821 19:27:12.632675       1 vsphere.go:1446] VM node1, is not in poweredOn state\n","stream":"stderr","time":"2017-08-21T19:27:12.632845421Z"}
{"log":"I0821 19:27:33.459577       1 controller_utils.go:285] Recording status change NodeNotReady event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:33.459956784Z"}{"log":"I0821 19:27:33.459653       1 controller_utils.go:203] Update ready status of pods on node [node1]\n","stream":"stderr","time":"2017-08-21T19:27:33.459975458Z"}
{"log":"I0821 19:27:33.459791       1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'NodeNotReady' Node node1 status is now: NodeNotReady\n","stream":"stderr","time":"2017-08-21T19:27:33.459981155Z"}
{"log":"W0821 19:27:33.553165       1 vsphere.go:611] VM node1, is not in poweredOn state\n","stream":"stderr","time":"2017-08-21T19:27:33.553316921Z"}
{"log":"I0821 19:27:33.553213       1 nodecontroller.go:752] Deleting node (no longer present in cloud provider): node1\n","stream":"stderr","time":"2017-08-21T19:27:33.553369294Z"}
{"log":"I0821 19:27:33.553220       1 controller_utils.go:274] Recording Deleting Node node1 because it's not present according to cloud provider event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:33.55337406Z"}
{"log":"I0821 19:27:33.553384       1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingNode' Node node1 event: Deleting Node node1 because it's not present according to cloud provider\n","stream":"stderr","time":"2017-08-21T19:27:33.553443403Z"}
{"log":"I0821 19:27:38.553459       1 nodecontroller.go:623] NodeController observed a Node deletion: node1\n","stream":"stderr","time":"2017-08-21T19:27:38.553720811Z"}
{"log":"I0821 19:27:38.553518       1 controller_utils.go:274] Recording Removing Node node1 from NodeController event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:38.553760747Z"}
{"log":"I0821 19:27:38.553811       1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RemovingNode' Node node1 event: Removing Node node1 from NodeController\n","stream":"stderr","time":"2017-08-21T19:27:38.554956045Z"}
{"log":"I0821 19:27:53.531262       1 gc_controller.go:157] Found orphaned Pod redis-server-2448813087-m62zw assigned to the Node node1. Deleting.\n","stream":"stderr","time":"2017-08-21T19:27:53.531376237Z"}
{"log":"I0821 19:27:53.531267       1 gc_controller.go:62] PodGC is force deleting Pod: default:redis-server-2448813087-m62zw\n","stream":"stderr","time":"2017-08-21T19:27:53.53138159Z"}
{"log":"I0821 19:27:53.557972       1 gc_controller.go:161] Forced deletion of orphaned Pod redis-server-2448813087-m62zw succeeded\n","stream":"stderr","time":"2017-08-21T19:27:53.558039494Z"}
  1. Immediately started attaching volume on the new node - node3 in this case and it succeeded.
{"log":"I0821 19:27:56.771960       1 reconciler.go:208] Started AttachVolume for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\" to node \"node3\"\n","stream":"stderr","time":"2017-08-21T19:27:56.772160877Z"}
{"log":"I0821 19:27:57.804622       1 operation_generator.go:290] AttachVolume.Attach succeeded for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\" (spec.Name: \"pvc-208563c0-86a5-11e7-bff5-005056bd2502\") from node \"node3\".\n","stream":"stderr","time":"2017-08-21T19:27:57.804762946Z"}
{"log":"I0821 19:27:57.880295       1 node_status_updater.go:143] Updating status for node \"node3\" succeeded. patchBytes: \"{\\\"status\\\":{\\\"volumesAttached\\\":[{\\\"devicePath\\\":\\\"/dev/disk/by-id/wwn-0x6000c291fb5e79d81ea1958e888c9d37\\\",\\\"name\\\":\\\"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\\\"}]}}\" VolumesAttached: [{kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk /dev/disk/by-id/wwn-0x6000c291fb5e79d81ea1958e888c9d37}]\n","stream":"stderr","time":"2017-08-21T19:27:57.880397598Z"}

@BaluDontu
Copy link
Contributor Author

@jingxu97 : PR still says lgtm label is not present. Can you try it again?

@jingxu97 jingxu97 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 23, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit b75d423 into kubernetes:master Aug 23, 2017
@wojtek-t wojtek-t added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 24, 2017
k8s-github-robot pushed a commit that referenced this pull request Aug 25, 2017
…-upstream-release-1.7

Automatic merge from submit-queue

Automated cherry pick of #51066 upstream release 1.7

Cherry pick of #51066 on release-1.7:

#51066: Allow attach of volumes to multiple nodes for vSphere

@rohitjogvmw  @divyenpatel @tusharnt @luomiao 

```release-note
vSphere: Allow attach of volumes to multiple nodes for vSphere.
```
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@BaluDontu BaluDontu deleted the MultiAttachVolumeIssueVsphere branch September 5, 2017 23:09
k8s-github-robot pushed a commit that referenced this pull request Aug 29, 2018
Automatic merge from submit-queue (batch tested with PRs 67745, 67432, 67569, 67825, 67943). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

Fix VMWare VM freezing bug by reverting #51066

**What this PR does / why we need it**: kube-controller-manager, VSphere specific: When the controller tries to attach a Volume to Node A that is already attached to Node B, Node A freezes until the volume is attached.  Kubernetes continues to try to attach the volume as it thinks that it's 'multi-attachable' when it's not. #51066 is the culprit.


**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes vmware-archive#500 / vmware-archive#502 (same issue)

**Special notes for your reviewer**:

- Repro:

Vsphere installation, any k8s version from 1.8 and above, pod with attached PV/PVC/VMDK:

1. cordon the node which the pod is in
2. `kubectl delete po/[pod] --force --grace-period=0`
3. the pod is immediately rescheduled to a new node. Grab the new node from a `kubectl describe [pod]` and attempt to Ping it or SSH into it.
4. you can see that pings/ssh fail to reach the new node. `kubectl get node` shows it as 'NotReady'. New node is frozen until the volume is attached - usually 1 minute freeze for 1 volume in a low-load cluster, and many minutes more with higher loads and more volumes involved.

- Patch verification:

Tested a custom patched 1.9.10 kube-controller-manager with #51066 reverted and the above bug is resolved - can't repro it anymore. New node doesn't freeze at all, and attaching happens quite quickly, in a few seconds.


**Release note**:

``` 
Fix VSphere VM Freezing bug by reverting #51066 

```
k8s-github-robot pushed a commit that referenced this pull request Sep 7, 2018
…5-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

Cherry pick of #67825 on release-1.10.

#67825: Fix VMWare VM freezing bug by reverting #51066
k8s-github-robot pushed a commit that referenced this pull request Sep 7, 2018
…5-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

Cherry pick of #67825 on release-1.9.

#67825: Fix VMWare VM freezing bug by reverting #51066
k8s-ci-robot added a commit that referenced this pull request Sep 10, 2018
…5-upstream-release-1.11

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066
@redbaron
Copy link
Contributor

redbaron commented Oct 9, 2018

When this change was applied/discussed, what vsphere integration expected to do in the "normal" case? when pod is restarted on another node, this changes makes volume on a new node to be attached first, before detach on a previous node is executed.

We see kernel panics where volume is attached before detaching it on another node, so I am asking to find out if there is a bug lurking elsewhere even though this particular change was reverted. Looking at the discussion, it seems that normal case would be handled by vsphere cloud provider, but what exactly is expected behaviour here? Should it report error when asked to attach volume without attempting to attach if volume is attached to a healthy node? Should it detach as part of attach?

@alvaroaleman
Copy link
Member

To clarify this for future users who may stumble on this: The change made here got reverted in #67825. The Kubernetes versions that do not include the change introduced here anymore are:

 - `v1.10.8` (and later)
 - `v1.11.4` (and later)
 - `v1.12+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.