Don't try to attach volumes which are already attached to other nodes #45346

codablock · 2017-05-04T10:46:37Z

This PR is a replacement for #40148. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore.

CC @saad-ali You probably have to update the PR link in Q2 2017 (v1.7)

I assume the PR will need a new "ok to test"

ORIGINAL PR DESCRIPTION

This PR fixes an issue with the attach/detach volume controller. There are cases where the desiredStateOfWorld contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...

I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.

This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.

On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74

The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.

Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.

I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in reconcile.

CC @colemickens @rootfs

-->

Don't try to attach volume to new node if it is already attached to another node and the volume does not support multi-attach.

k8s-reviewable · 2017-05-04T10:46:43Z

This change is

k8s-ci-robot · 2017-05-04T10:46:45Z

Hi @codablock. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

rootfs · 2017-05-04T14:20:54Z

/cc @jingxu97

rootfs · 2017-05-04T14:26:00Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+		// Check for volume types which are known to fail slow or cause trouble when trying to multi-attach
+		if volumeSpec.Volume.AzureDisk != nil ||
+			volumeSpec.Volume.Cinder != nil {
+			return true


according to this page https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
There are 6 types of storage that support multi-write, while 13 of them not. Maybe we can make a map of it as a constant, and check whether the volume type is one of them?

I'd suggest to do this refactoring after this PR is merged. Currently it's hard for me to do extensive testing as I'm working on non-k8s projects atm.

make a note here for follow up.

Done: #45596

rootfs · 2017-05-04T14:27:21Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

@@ -124,6 +125,41 @@ func (rc *reconciler) syncStates() {
 	rc.attacherDetacher.VerifyVolumesAreAttached(volumesPerNode, rc.actualStateOfWorld)
 }

+// isMultiAttachForbidden checks if attaching this volume to multiple nodes is definitely not allowed/possible.
+// In it's current form, this function can only reliably say for which volumes it's definitely forbidden. If it returns


In its current form

rootfs · 2017-05-04T14:31:40Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+	if volumeSpec.PersistentVolume != nil {
+		if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
+			// No access mode specified so we don't know for sure. Let the attacher fail if needed
+			return false


this shouldn't happen, log an error

AccessModes is optional according to

kubernetes/pkg/api/v1/types.go

Line 477 in 17d33ea

// AccessModes contains all ways the volume can be mounted.

rootfs · 2017-05-04T14:35:52Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+			if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
+				nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
+				if len(nodes) > 0 {
+					glog.V(4).Infof("Volume %q is already exclusively attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes, volumeToAttach.NodeName)


if it is not spamming, V(3) is more helpful.

This one is spamming (10 times per second) if it happens. On Azure it always happened while volumes were being moved to other nodes as long as it was not detached from the old node yet.

rootfs · 2017-05-04T14:37:20Z

Most lgtm. Wait for @jingxu97

rootfs · 2017-05-04T14:39:58Z

@k8s-bot ok to test

jingxu97 · 2017-05-04T17:38:26Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

-				!exponentialbackoff.IsExponentialBackoff(err) {
-				// Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected.
+			if err != nil && !exponentialbackoff.IsExponentialBackoff(err) {
+				// Ignore exponentialbackoff.IsExponentialBackoff errors, they are expected.


why remove isAlreadyExists error?

Please see our old discussion at #40148 (review)

Now I am thinking although reconciler is currently the only place issue those volume operations, it might not be true in the future. So it might be better to put the isAlreadyExists error back so we know what is the problem.

The same thought was the reason to remove the check. Please note that the check was there to actually prevent the logging of the error, as it was probably considered "normal" that it happened.

Before this PR, the reconciler would try to start the operation every 100ms, only succeeding the first time and erroring for every try that followed. This check was the reason that there was never an error message shown.

Now that we have the if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "") check a few lines above, this error should not happen anymore and if it does for what ever reason, it should be logged.

jingxu97 · 2017-05-04T17:44:43Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+			// may pass while at the same time the volume leaves the pending state, resulting in
+			// double detach attempts
+			if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "") {
+				glog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName)


I might forget, is this also a bug fix or just an improvement?

Please see our old discussion at #40148 (review)

rootfs · 2017-05-10T13:07:45Z

@codablock please address #45346 (comment)
#45346 (comment)

make a note on #45346 (comment)

then lgtm

codablock · 2017-05-10T14:04:19Z

@rootfs Unfortunately GitHub does not show correct links anymore for links to comments, so I don't know which comments you actually meant. If you meant the "it's" thing...I just pushed that (for some reason I missed it in my last push).

rootfs · 2017-05-10T14:09:41Z

@codablock commented, please make a note on followup work on accessmode refactoring.

rootfs · 2017-05-10T14:36:48Z

/lgtm

/assign @saad-ali for approval

rootfs · 2017-05-16T17:40:06Z

@codablock need a rebase and fix this

k8s.io/kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go:435: not enough arguments in call to dsw.AddNode

saad-ali

Thanks for the PR @codablock.

/lgtm
/approve

saad-ali · 2017-05-16T21:00:23Z

Please fix the unit tests

saad-ali · 2017-05-18T00:52:43Z

Something else I was thinking about: this will fundamentally change the way that users receive information about a common error case.

Before this change, if a disk was ReadWriteOnce or ReadOnlyOnce and a user referenced it in more than 1 pod (scheduled on different nodes), then all but the first pod would fail attach, and an event would be generated (see code) so that users knew that subseqent pods were failing to start because the volume could not be attached because the volume was already attached to another node.

With this change, the reconciler now doesn't even call operation_executor, and just silently refuses to attach the subsequent volumes, (V(10) warning in log file doesn't count), so instead of seeing an event indicating that attach failed for a particular pod because that volume is already attached to another node, users will see a generic pod failed to start because it timed out waiting for attach (attach never generates an error event). This would be a bit of a regression from the user experience perspective. We can add an event but since the reconciler code doesn't handle exponential back off and executes in a tight loop, we'd need to add logic to handle that as well.

You can address this in this PR or open up another issue to address it a follow up PR, regardless I think it's worth fixing before 1.7 ships with this change.

codablock · 2017-05-18T05:00:12Z

@saad-ali Unit tests should be fixed now.

I most likely won't have time to add new UX error handling as I can only work on k8s on my spare time atm. I added an issue to track it: #46012

rootfs · 2017-05-18T12:40:51Z

/lgtm

saad-ali · 2017-05-19T18:12:16Z

/lgtm
/approve

k8s-github-robot · 2017-05-19T18:12:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: codablock, rootfs, saad-ali

Needs approval from an approver in each of these OWNERS Files:

~~pkg/controller/volume/attachdetach/OWNERS~~ [saad-ali]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

NickrenREN · 2017-05-20T02:48:23Z

@k8s-bot kubemark e2e test this

k8s-ci-robot · 2017-05-20T04:36:01Z

@codablock: The following test(s) failed:

Test name	Commit	Details	Rerun command
pull-kubernetes-federation-e2e-gce	`06baeb3`	link	`@k8s-bot pull-kubernetes-federation-e2e-gce test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-05-20T04:54:42Z

Automatic merge from submit-queue

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 4, 2017

k8s-github-robot assigned rootfs and pmorie May 4, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 4, 2017

codablock mentioned this pull request May 4, 2017

Don't try to attach volumes which are already attached to other nodes #40148

Closed

rootfs mentioned this pull request May 4, 2017

detach the volume when pod is terminated #45286

Merged

k8s-ci-robot requested a review from jingxu97 May 4, 2017 14:20

rootfs reviewed May 4, 2017

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 4, 2017

jingxu97 reviewed May 4, 2017

View reviewed changes

codablock force-pushed the fix_double_attach branch from a8fe5a9 to 3308935 Compare May 9, 2017 12:25

codablock force-pushed the fix_double_attach branch from 3308935 to 3c02910 Compare May 10, 2017 14:03

codablock mentioned this pull request May 10, 2017

Refactor attachdetach reconciler to use a map for multi-attach support #45596

Closed

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2017

saad-ali assigned saad-ali and jingxu97 May 16, 2017

saad-ali approved these changes May 16, 2017

View reviewed changes

codablock force-pushed the fix_double_attach branch from 3c02910 to dcff890 Compare May 18, 2017 04:43

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2017

Don't try to attach volumes which are already attached to other nodes

06baeb3

codablock force-pushed the fix_double_attach branch from dcff890 to 06baeb3 Compare May 18, 2017 04:56

codablock mentioned this pull request May 18, 2017

Fix regression in UX experience for double attach volume failure #46012

Closed

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2017

k8s-github-robot merged commit f499606 into kubernetes:master May 20, 2017

gnufied mentioned this pull request Jun 13, 2017

AccessMode doesn't work properly within PVC #47333

Closed

rootfs mentioned this pull request Jun 20, 2017

WIP: Give RWO PVs node affinity once they have attached #47749

Closed

saad-ali mentioned this pull request Jul 11, 2017

reconciler Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany flake #46244

Closed

jingxu97 mentioned this pull request Aug 6, 2017

Pods with volumes stuck in ContainerCreating after cluster node is deleted from OpenStack #50200

Closed

This was referenced Aug 18, 2017

Pods with volumes stuck in ContainerCreating after cluster node is powered off in vCenter #50944

Closed

Pods with volumes stuck in ContainerCreating after cluster node is powered off in vCenter vmware-archive/kubernetes-archived#248

Closed

mlmhl mentioned this pull request Jan 16, 2018

After device is attached, devicepath of FC and iSCSI plugin are empty #54108

Closed

Don't try to attach volumes which are already attached to other nodes #45346

Don't try to attach volumes which are already attached to other nodes #45346

Conversation

codablock commented May 4, 2017 • edited by saad-ali Loading

k8s-reviewable commented May 4, 2017

k8s-ci-robot commented May 4, 2017

rootfs commented May 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented May 4, 2017

rootfs commented May 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented May 10, 2017

codablock commented May 10, 2017

rootfs commented May 10, 2017

rootfs commented May 10, 2017

rootfs commented May 16, 2017

saad-ali left a comment

Choose a reason for hiding this comment

saad-ali commented May 16, 2017

saad-ali commented May 18, 2017

codablock commented May 18, 2017

rootfs commented May 18, 2017

saad-ali commented May 19, 2017

k8s-github-robot commented May 19, 2017

NickrenREN commented May 20, 2017

k8s-ci-robot commented May 20, 2017 • edited Loading

k8s-github-robot commented May 20, 2017

codablock commented May 4, 2017 •

edited by saad-ali

Loading

k8s-ci-robot commented May 20, 2017 •

edited

Loading