Skip to content

Conversation

@codablock
Copy link
Contributor

@codablock codablock commented Jan 19, 2017

EDIT: REPLACED BY #45346

This PR fixes an issue with the attach/detach volume controller. There are cases where the desiredStateOfWorld contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...

I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.

This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.

On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74

The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.

Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.

I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in reconcile.

CC @colemickens @rootfs

-->

Don't try to attach volumes to nodes if they are already attached to other nodes

@k8s-ci-robot
Copy link
Contributor

Hi @codablock. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 19, 2017
@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jan 19, 2017
@gnufied
Copy link
Member

gnufied commented Jan 19, 2017

@k8s-bot ok to test

glog.V(5).Infof("Volume %q is already attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes[0], volumeToAttach.NodeName)
continue
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for this in reconciler_test.go ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 2 tests, one which tests the initial issue this PR fixes and one that tests if a volume is reattached to another node when it is detached from the first node.

@rootfs
Copy link
Contributor

rootfs commented Jan 19, 2017

@codablock I hit the same idea before. The problem with this approach is that there could exist a race window when no node is detected to attach the volume and before the volume is to be attached.

@codablock
Copy link
Contributor Author

@rootfs Can you describe this in more detail? Not sure if I understand you correctly.

@codablock
Copy link
Contributor Author

Had a short discussion on Slack and it looks like I need to take care of cases where volumes have other access modes then ReadWriteOnce. An alternative would be to check for volume types that don't support multi node attachment, e.g. all block devices.

@jingxu97
Copy link
Contributor

@codablock I think you can change the function GetNodesForVolume to GetExclusiveNodesForVolume which only returns nodes that the volume attached to with readwrite option. It checks whether the the readOnly spec is set to true or not, and return empty list if it is true.

@codablock
Copy link
Contributor Author

@jingxu97 Thanks for the comment and your offer to help on Slack. As I see you worked on a few parts of the volume management, so you'll probably be able to answer my questions.

  1. I initially assumed that the described situation where 2 PODs try to attach/mount the same volume (with ReadWriteOnce) should have been prevented before this even gets into the desiredStateOfTheWorld, maybe by the scheduler or whatever decides which POD should use which volume. Is it possible that there is another place in the k8s code that needs additional fixing? The fix from this PR (including your suggestion) should still be in the code as a last barrier to prevent double attaching/mounting when it's not allowed/possible.
  2. As I understood the code, the desiredStateOfWorldPopulator is just for initial population of the cache and periodic sync in case some events (podAdd, podDelete, podUpdate) were missed. Is this correct?

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @saad-ali
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@colemickens
Copy link
Contributor

ping @saad-ali for merge + cherry-pick.

@codablock
Copy link
Contributor Author

Do not merge atm please. I still need to add a check for ReadWriteOnce.

@codablock
Copy link
Contributor Author

@rootfs I think I found the race condition you mentioned. After the reconciler starts attachment for node-1, this is not reported to actualStateOfTheWorld until attachment has completed in the background. This means, attachment for node-2 will start immediately in parallel. Did you already think about a solution?

I'm thinking about checking for pending operations as well as a solution.

@codablock codablock force-pushed the fix_double_attach branch 2 times, most recently from 3e369a8 to b4caac5 Compare January 25, 2017 18:11
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 25, 2017
@codablock
Copy link
Contributor Author

I've just pushed a new version with some additional checks and tests.

I had to change the way alreadyExistsError was handled. Instead of letting AttachVolume/DetachVolume return an error and check for the error, I now check for IsOperationPending before I even try to start the operation. This was needed to have a predictable number of calls to NewAttacher so that the tests can reliably wait for an expected count when multiple nodes are involved.

@codablock codablock force-pushed the fix_double_attach branch 3 times, most recently from 0f35498 to f4a46c8 Compare January 26, 2017 07:59
@codablock
Copy link
Contributor Author

codablock commented Jan 26, 2017

I'm sorry for all the build failures. Locally everything was working fine, even with make release. CI however kept failing as I didn't know I have to run hack/update-bazel.sh

@codablock
Copy link
Contributor Author

@k8s-bot bazel test this

@codablock
Copy link
Contributor Author

@k8s-bot unit test this
@k8s-bot verify test this

@codablock
Copy link
Contributor Author

@k8s-bot unit test this

!exponentialbackoff.IsExponentialBackoff(err) {
// Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected.
if err != nil && !exponentialbackoff.IsExponentialBackoff(err) {
// Ignore exponentialbackoff.IsExponentialBackoff errors, they are expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is not necessary to remove !nestedpendingoperations.IsAlreadyExists(err) since this error still might be returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the reconciler loop is the only place where operations might be started, and the loop is a single goroutine, I would expect that it's not possible to have this error returned when a previous IsOperationPending call returned false. If it still happens, there is a bug and I would prefer to see the error in the log.

Or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It should not happen.

if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
if len(nodes) > 0 {
glog.V(5).Infof("Volume %q is already exclusively attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes, volumeToAttach.NodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to make to level 4 for this log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jingxu97
Copy link
Contributor

jingxu97 commented Feb 1, 2017

lgtm. @saad-ali you want to take a look too?

@codablock
Copy link
Contributor Author

I just found #39055 because someone on Slack posted a link to it. It looks like that PR introduced the issue/question #39791.

If this PR gets merged, #39791 should probably be fixed as well as the attachdetach controller won't try to attach the volume to the new node before detach is finished. This would however require to add the cinder volumes to the check in isMultiAttachForbidden the same way as Azure disks are checked. If this is ok, I'll add and push this change.

@NickrenREN
Copy link
Contributor

Yes @codablock If so, cinder should also be added in isMultiAttachForbidden to forbid multiple attachment

@codablock
Copy link
Contributor Author

I've added Cinder to isMultiAttachForbidden.

@saad-ali Ping regarding review ;)

@codablock
Copy link
Contributor Author

@k8s-bot kops aws e2e test this

1 similar comment
@codablock
Copy link
Contributor Author

@k8s-bot kops aws e2e test this

@saad-ali
Copy link
Member

Taking a look

@codablock
Copy link
Contributor Author

@saad-ali @jingxu97 It turned out there is "volume conflict checking" in the scheduler which was missing for Azure, resulting in all the problems I tried to solve with this PR. I'm not sure how to proceed now. Please see #41398 (comment) for more details.

No matter how we continue with this PR, a fix in the scheduler is needed.

@saad-ali
Copy link
Member

@codablock I'd like to better understand the problems that Azure volumes are having. It sounds like some bigger infrastructure changes maybe needed to fix the issues. We are about 10 minutes away from code freeze for 1.6. While we can can get small bug fixes in during code freeze, the big changes will have to wait for 1.7. But I really want to make sure we don't miss 1.7. So how about we set up a meeting early in the 1.7 dev cycle (I'm thinking 2nd week of April), to discuss what the pain points are, propose some changes, and hopefully get them implemented in 1.7?

CC @kubernetes/sig-storage-misc

@codablock
Copy link
Contributor Author

@saad-ali As I won't have time to work on Kubernetes and/or Azure in the next few month I would suggest that @colemickens and @khenidak take over the discussion and do a meeting with you if required.

@khenidak is working on Azure managed disks and as I understood many of the problems I encountered with Azure Disks and which I tried to fix with PRs like these are also fixed in his work. So maybe this PR isn't even needed anymore.

@saad-ali
Copy link
Member

saad-ali commented Apr 6, 2017

One of the items on the storage backlog for Q2 2017 (v1.7) is improving Azure support. We should consider this PR as part of that effort.

CC @rootfs @chakri-nelluri for review

@codablock
Copy link
Contributor Author

@saad-ali Maybe #40603 is interesting for this as well?

@saad-ali saad-ali added this to the v1.7 milestone Apr 6, 2017
@saad-ali
Copy link
Member

saad-ali commented Apr 6, 2017

Ack. Tracking both PRs

@codablock
Copy link
Contributor Author

REPLACED BY #45346

@codablock codablock closed this May 4, 2017
k8s-github-robot pushed a commit that referenced this pull request May 20, 2017
Automatic merge from submit-queue

Don't try to attach volumes which are already attached to other nodes

This PR is a replacement for #40148. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore.

CC @saad-ali You probably have to update the PR link in [Q2 2017 (v1.7)](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/edit#gid=14624465)

I assume the PR will need a new "ok to test" 

**ORIGINAL PR DESCRIPTION**

This PR fixes an issue with the attach/detach volume controller. There are cases where the `desiredStateOfWorld` contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...

I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.

This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.

On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74

The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.

Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.

I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in `reconcile`.

CC @colemickens @rootfs

-->
```release-note
Don't try to attach volume to new node if it is already attached to another node and the volume does not support multi-attach.
```
}

if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set pv with accessMode 'ReadWriteOnce',pod with pv running on node1.
when node1 is down,schedule pod to node2,new pod waitting for attach,but old pod is still not detach if node1 is still down
isMultiAttachForbidden and GetNodeForVolume will produce
attachdetach-controller Multi-Attach error for volume "pvc-d0fde86c-8661-11e9-b873-0800271c9f15" Volume is already used by pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.