Don't try to attach volumes which are already attached to other nodes #40148

codablock · 2017-01-19T15:03:52Z

EDIT: REPLACED BY #45346

This PR fixes an issue with the attach/detach volume controller. There are cases where the desiredStateOfWorld contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...

I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.

This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.

On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74

The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.

Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.

I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in reconcile.

CC @colemickens @rootfs

-->

Don't try to attach volumes to nodes if they are already attached to other nodes

k8s-ci-robot · 2017-01-19T15:03:53Z

Hi @codablock. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-reviewable · 2017-01-19T15:03:57Z

This change is

gnufied · 2017-01-19T16:29:47Z

@k8s-bot ok to test

gnufied · 2017-01-19T16:30:34Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+				glog.V(5).Infof("Volume %q is already attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes[0], volumeToAttach.NodeName)
+				continue
+			}
+


Can we add a test for this in reconciler_test.go ?

I added 2 tests, one which tests the initial issue this PR fixes and one that tests if a volume is reattached to another node when it is detached from the first node.

rootfs · 2017-01-19T16:34:14Z

@codablock I hit the same idea before. The problem with this approach is that there could exist a race window when no node is detected to attach the volume and before the volume is to be attached.

codablock · 2017-01-19T16:45:17Z

@rootfs Can you describe this in more detail? Not sure if I understand you correctly.

codablock · 2017-01-20T09:21:06Z

Had a short discussion on Slack and it looks like I need to take care of cases where volumes have other access modes then ReadWriteOnce. An alternative would be to check for volume types that don't support multi node attachment, e.g. all block devices.

jingxu97 · 2017-01-20T20:39:06Z

@codablock I think you can change the function GetNodesForVolume to GetExclusiveNodesForVolume which only returns nodes that the volume attached to with readwrite option. It checks whether the the readOnly spec is set to true or not, and return empty list if it is true.

codablock · 2017-01-23T09:06:20Z

@jingxu97 Thanks for the comment and your offer to help on Slack. As I see you worked on a few parts of the volume management, so you'll probably be able to answer my questions.

I initially assumed that the described situation where 2 PODs try to attach/mount the same volume (with ReadWriteOnce) should have been prevented before this even gets into the desiredStateOfTheWorld, maybe by the scheduler or whatever decides which POD should use which volume. Is it possible that there is another place in the k8s code that needs additional fixing? The fix from this PR (including your suggestion) should still be in the code as a last barrier to prevent double attaching/mounting when it's not allowed/possible.
As I understood the code, the desiredStateOfWorldPopulator is just for initial population of the cache and periodic sync in case some events (podAdd, podDelete, podUpdate) were missed. Is this correct?

k8s-github-robot · 2017-01-24T02:55:59Z

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

pkg/controller/volume/attachdetach/OWNERS

We suggest the following people:
cc @saad-ali
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

colemickens · 2017-01-24T23:55:24Z

ping @saad-ali for merge + cherry-pick.

codablock · 2017-01-25T06:14:04Z

Do not merge atm please. I still need to add a check for ReadWriteOnce.

codablock · 2017-01-25T13:17:16Z

@rootfs I think I found the race condition you mentioned. After the reconciler starts attachment for node-1, this is not reported to actualStateOfTheWorld until attachment has completed in the background. This means, attachment for node-2 will start immediately in parallel. Did you already think about a solution?

I'm thinking about checking for pending operations as well as a solution.

codablock · 2017-01-25T18:16:59Z

I've just pushed a new version with some additional checks and tests.

I had to change the way alreadyExistsError was handled. Instead of letting AttachVolume/DetachVolume return an error and check for the error, I now check for IsOperationPending before I even try to start the operation. This was needed to have a predictable number of calls to NewAttacher so that the tests can reliably wait for an expected count when multiple nodes are involved.

codablock · 2017-01-26T08:00:08Z

I'm sorry for all the build failures. Locally everything was working fine, even with make release. CI however kept failing as I didn't know I have to run hack/update-bazel.sh

codablock · 2017-01-26T09:16:36Z

@k8s-bot bazel test this

codablock · 2017-01-26T09:29:06Z

@k8s-bot unit test this
@k8s-bot verify test this

codablock · 2017-01-30T11:26:13Z

@k8s-bot unit test this

jingxu97 · 2017-02-01T21:33:42Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

-				!exponentialbackoff.IsExponentialBackoff(err) {
-				// Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected.
+			if err != nil && !exponentialbackoff.IsExponentialBackoff(err) {
+				// Ignore exponentialbackoff.IsExponentialBackoff errors, they are expected.


I think there is not necessary to remove !nestedpendingoperations.IsAlreadyExists(err) since this error still might be returned.

As the reconciler loop is the only place where operations might be started, and the loop is a single goroutine, I would expect that it's not possible to have this error returned when a previous IsOperationPending call returned false. If it still happens, there is a bug and I would prefer to see the error in the log.

Or am I missing something?

You are right. It should not happen.

jingxu97 · 2017-02-01T21:34:23Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+			if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
+				nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
+				if len(nodes) > 0 {
+					glog.V(5).Infof("Volume %q is already exclusively attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes, volumeToAttach.NodeName)


I suggest to make to level 4 for this log.

jingxu97 · 2017-02-01T21:53:19Z

lgtm. @saad-ali you want to take a look too?

codablock · 2017-02-07T21:57:37Z

I just found #39055 because someone on Slack posted a link to it. It looks like that PR introduced the issue/question #39791.

If this PR gets merged, #39791 should probably be fixed as well as the attachdetach controller won't try to attach the volume to the new node before detach is finished. This would however require to add the cinder volumes to the check in isMultiAttachForbidden the same way as Azure disks are checked. If this is ok, I'll add and push this change.

NickrenREN · 2017-02-08T03:14:56Z

Yes @codablock If so, cinder should also be added in isMultiAttachForbidden to forbid multiple attachment

codablock · 2017-02-08T07:35:16Z

I've added Cinder to isMultiAttachForbidden.

@saad-ali Ping regarding review ;)

codablock · 2017-02-08T08:42:20Z

@k8s-bot kops aws e2e test this

codablock · 2017-02-09T13:27:37Z

@k8s-bot kops aws e2e test this

saad-ali · 2017-02-14T21:41:37Z

Taking a look

codablock · 2017-02-25T11:48:54Z

@saad-ali @jingxu97 It turned out there is "volume conflict checking" in the scheduler which was missing for Azure, resulting in all the problems I tried to solve with this PR. I'm not sure how to proceed now. Please see #41398 (comment) for more details.

No matter how we continue with this PR, a fix in the scheduler is needed.

saad-ali · 2017-02-28T01:53:47Z

@codablock I'd like to better understand the problems that Azure volumes are having. It sounds like some bigger infrastructure changes maybe needed to fix the issues. We are about 10 minutes away from code freeze for 1.6. While we can can get small bug fixes in during code freeze, the big changes will have to wait for 1.7. But I really want to make sure we don't miss 1.7. So how about we set up a meeting early in the 1.7 dev cycle (I'm thinking 2nd week of April), to discuss what the pain points are, propose some changes, and hopefully get them implemented in 1.7?

CC @kubernetes/sig-storage-misc

codablock · 2017-03-13T09:12:26Z

@saad-ali As I won't have time to work on Kubernetes and/or Azure in the next few month I would suggest that @colemickens and @khenidak take over the discussion and do a meeting with you if required.

@khenidak is working on Azure managed disks and as I understood many of the problems I encountered with Azure Disks and which I tried to fix with PRs like these are also fixed in his work. So maybe this PR isn't even needed anymore.

saad-ali · 2017-04-06T01:11:45Z

One of the items on the storage backlog for Q2 2017 (v1.7) is improving Azure support. We should consider this PR as part of that effort.

CC @rootfs @chakri-nelluri for review

codablock · 2017-04-06T06:43:17Z

@saad-ali Maybe #40603 is interesting for this as well?

saad-ali · 2017-04-06T21:16:53Z

Ack. Tracking both PRs

codablock · 2017-05-04T10:48:16Z

REPLACED BY #45346

@saad-ali

Automatic merge from submit-queue Don't try to attach volumes which are already attached to other nodes This PR is a replacement for #40148. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore. CC @saad-ali You probably have to update the PR link in [Q2 2017 (v1.7)](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/edit#gid=14624465) I assume the PR will need a new "ok to test" **ORIGINAL PR DESCRIPTION** This PR fixes an issue with the attach/detach volume controller. There are cases where the `desiredStateOfWorld` contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ... I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately. This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully. On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time. When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74 The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching. Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above. I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in `reconcile`. CC @colemickens @rootfs --> ```release-note Don't try to attach volume to new node if it is already attached to another node and the volume does not support multi-attach. ```

zhonglin6666 · 2019-06-05T06:38:20Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+			}
+
+			if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
+				nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)


If set pv with accessMode 'ReadWriteOnce'，pod with pv running on node1.
when node1 is down，schedule pod to node2，new pod waitting for attach，but old pod is still not detach if node1 is still down
isMultiAttachForbidden and GetNodeForVolume will produce
attachdetach-controller Multi-Attach error for volume "pvc-d0fde86c-8661-11e9-b873-0800271c9f15" Volume is already used by pod

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 19, 2017

k8s-github-robot assigned saad-ali Jan 19, 2017

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jan 19, 2017

codablock mentioned this pull request Jan 19, 2017

Changing node when mounting an Azure disk is taking forever Azure/acs-engine#192

Closed

gnufied reviewed Jan 19, 2017

View reviewed changes

codablock force-pushed the fix_double_attach branch from 13ba472 to 7f8dca9 Compare January 19, 2017 16:48

codablock force-pushed the fix_double_attach branch 2 times, most recently from 3e369a8 to b4caac5 Compare January 25, 2017 18:11

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 25, 2017

codablock force-pushed the fix_double_attach branch 3 times, most recently from 0f35498 to f4a46c8 Compare January 26, 2017 07:59

codablock force-pushed the fix_double_attach branch from b90d1d9 to f978727 Compare January 30, 2017 13:10

codablock mentioned this pull request Jan 31, 2017

Use Watch() for VerifyControllerAttachedVolume instead of a single poll #40603

Closed

jingxu97 reviewed Feb 1, 2017

View reviewed changes

codablock force-pushed the fix_double_attach branch from f978727 to 23d7c87 Compare February 1, 2017 21:42

codablock mentioned this pull request Feb 7, 2017

Should we detach an attached cinder volume forcibly before attaching elsewhere #39791

Closed

Don't try to attach volumes which are already attached to other nodes

58d02c1

codablock force-pushed the fix_double_attach branch from 23d7c87 to 58d02c1 Compare February 8, 2017 07:33

codablock mentioned this pull request Feb 25, 2017

Add scheduler predicate to filter for max Azure disks attached #41398

Merged

saad-ali added this to the v1.7 milestone Apr 6, 2017

codablock mentioned this pull request May 4, 2017

Don't try to attach volumes which are already attached to other nodes #45346

Merged

codablock closed this May 4, 2017

verult mentioned this pull request Feb 21, 2019

Slow volume attach for single ReadOnlyMany volume across multiple nodes #73972

Closed

zhonglin6666 reviewed Jun 5, 2019

View reviewed changes

Don't try to attach volumes which are already attached to other nodes #40148

Don't try to attach volumes which are already attached to other nodes #40148

Uh oh!

Conversation

codablock commented Jan 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 19, 2017

Uh oh!

k8s-reviewable commented Jan 19, 2017

Uh oh!

gnufied commented Jan 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rootfs commented Jan 19, 2017

Uh oh!

codablock commented Jan 19, 2017

Uh oh!

codablock commented Jan 20, 2017

Uh oh!

jingxu97 commented Jan 20, 2017

Uh oh!

codablock commented Jan 23, 2017

Uh oh!

k8s-github-robot commented Jan 24, 2017

Uh oh!

colemickens commented Jan 24, 2017

Uh oh!

codablock commented Jan 25, 2017

Uh oh!

codablock commented Jan 25, 2017

Uh oh!

codablock commented Jan 25, 2017

Uh oh!

codablock commented Jan 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codablock commented Jan 26, 2017

Uh oh!

codablock commented Jan 26, 2017

Uh oh!

codablock commented Jan 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingxu97 commented Feb 1, 2017

Uh oh!

codablock commented Feb 7, 2017

Uh oh!

NickrenREN commented Feb 8, 2017

Uh oh!

codablock commented Feb 8, 2017

Uh oh!

codablock commented Feb 8, 2017

Uh oh!

codablock commented Feb 9, 2017

Uh oh!

saad-ali commented Feb 14, 2017

Uh oh!

codablock commented Feb 25, 2017

Uh oh!

saad-ali commented Feb 28, 2017

Uh oh!

codablock commented Mar 13, 2017

Uh oh!

saad-ali commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codablock commented Apr 6, 2017

codablock commented Jan 19, 2017 •

edited

Loading

codablock commented Jan 26, 2017 •

edited

Loading

saad-ali commented Apr 6, 2017 •

edited

Loading