Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't try to attach volumes which are already attached to other nodes #45346

Merged
merged 1 commit into from
May 20, 2017

Conversation

codablock
Copy link
Contributor

@codablock codablock commented May 4, 2017

This PR is a replacement for #40148. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore.

CC @saad-ali You probably have to update the PR link in Q2 2017 (v1.7)

I assume the PR will need a new "ok to test"

ORIGINAL PR DESCRIPTION

This PR fixes an issue with the attach/detach volume controller. There are cases where the desiredStateOfWorld contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...

I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.

This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.

On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74

The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.

Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.

I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in reconcile.

CC @colemickens @rootfs

-->

Don't try to attach volume to new node if it is already attached to another node and the volume does not support multi-attach.

@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 4, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @codablock. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 4, 2017
@rootfs
Copy link
Contributor

rootfs commented May 4, 2017

/cc @jingxu97

@k8s-ci-robot k8s-ci-robot requested a review from jingxu97 May 4, 2017 14:20
// Check for volume types which are known to fail slow or cause trouble when trying to multi-attach
if volumeSpec.Volume.AzureDisk != nil ||
volumeSpec.Volume.Cinder != nil {
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to this page https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
There are 6 types of storage that support multi-write, while 13 of them not. Maybe we can make a map of it as a constant, and check whether the volume type is one of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to do this refactoring after this PR is merged. Currently it's hard for me to do extensive testing as I'm working on non-k8s projects atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a note here for follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: #45596

@@ -124,6 +125,41 @@ func (rc *reconciler) syncStates() {
rc.attacherDetacher.VerifyVolumesAreAttached(volumesPerNode, rc.actualStateOfWorld)
}

// isMultiAttachForbidden checks if attaching this volume to multiple nodes is definitely not allowed/possible.
// In it's current form, this function can only reliably say for which volumes it's definitely forbidden. If it returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In its current form

if volumeSpec.PersistentVolume != nil {
if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
// No access mode specified so we don't know for sure. Let the attacher fail if needed
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't happen, log an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AccessModes is optional according to

// AccessModes contains all ways the volume can be mounted.

if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
if len(nodes) > 0 {
glog.V(4).Infof("Volume %q is already exclusively attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes, volumeToAttach.NodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not spamming, V(3) is more helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is spamming (10 times per second) if it happens. On Azure it always happened while volumes were being moved to other nodes as long as it was not detached from the old node yet.

@rootfs
Copy link
Contributor

rootfs commented May 4, 2017

Most lgtm. Wait for @jingxu97

@rootfs
Copy link
Contributor

rootfs commented May 4, 2017

@k8s-bot ok to test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 4, 2017
!exponentialbackoff.IsExponentialBackoff(err) {
// Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected.
if err != nil && !exponentialbackoff.IsExponentialBackoff(err) {
// Ignore exponentialbackoff.IsExponentialBackoff errors, they are expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove isAlreadyExists error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see our old discussion at #40148 (review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I am thinking although reconciler is currently the only place issue those volume operations, it might not be true in the future. So it might be better to put the isAlreadyExists error back so we know what is the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thought was the reason to remove the check. Please note that the check was there to actually prevent the logging of the error, as it was probably considered "normal" that it happened.

Before this PR, the reconciler would try to start the operation every 100ms, only succeeding the first time and erroring for every try that followed. This check was the reason that there was never an error message shown.

Now that we have the if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "") check a few lines above, this error should not happen anymore and if it does for what ever reason, it should be logged.

// may pass while at the same time the volume leaves the pending state, resulting in
// double detach attempts
if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "") {
glog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might forget, is this also a bug fix or just an improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see our old discussion at #40148 (review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

@rootfs
Copy link
Contributor

rootfs commented May 10, 2017

@codablock please address #45346 (comment)
#45346 (comment)

make a note on #45346 (comment)

then lgtm

@codablock
Copy link
Contributor Author

@rootfs Unfortunately GitHub does not show correct links anymore for links to comments, so I don't know which comments you actually meant. If you meant the "it's" thing...I just pushed that (for some reason I missed it in my last push).

@rootfs
Copy link
Contributor

rootfs commented May 10, 2017

@codablock commented, please make a note on followup work on accessmode refactoring.

@rootfs
Copy link
Contributor

rootfs commented May 10, 2017

/lgtm

/assign @saad-ali for approval

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2017
@rootfs
Copy link
Contributor

rootfs commented May 16, 2017

@codablock need a rebase and fix this

k8s.io/kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go:435: not enough arguments in call to dsw.AddNode

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @codablock.

/lgtm
/approve

@saad-ali
Copy link
Member

Please fix the unit tests

@saad-ali
Copy link
Member

Something else I was thinking about: this will fundamentally change the way that users receive information about a common error case.

Before this change, if a disk was ReadWriteOnce or ReadOnlyOnce and a user referenced it in more than 1 pod (scheduled on different nodes), then all but the first pod would fail attach, and an event would be generated (see code) so that users knew that subseqent pods were failing to start because the volume could not be attached because the volume was already attached to another node.

With this change, the reconciler now doesn't even call operation_executor, and just silently refuses to attach the subsequent volumes, (V(10) warning in log file doesn't count), so instead of seeing an event indicating that attach failed for a particular pod because that volume is already attached to another node, users will see a generic pod failed to start because it timed out waiting for attach (attach never generates an error event). This would be a bit of a regression from the user experience perspective. We can add an event but since the reconciler code doesn't handle exponential back off and executes in a tight loop, we'd need to add logic to handle that as well.

You can address this in this PR or open up another issue to address it a follow up PR, regardless I think it's worth fixing before 1.7 ships with this change.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2017
@codablock
Copy link
Contributor Author

@saad-ali Unit tests should be fixed now.

I most likely won't have time to add new UX error handling as I can only work on k8s on my spare time atm. I added an issue to track it: #46012

@rootfs
Copy link
Contributor

rootfs commented May 18, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2017
@saad-ali
Copy link
Member

/lgtm
/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: codablock, rootfs, saad-ali

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2017
@NickrenREN
Copy link
Contributor

@k8s-bot kubemark e2e test this

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 20, 2017

@codablock: The following test(s) failed:

Test name Commit Details Rerun command
pull-kubernetes-federation-e2e-gce 06baeb3 link @k8s-bot pull-kubernetes-federation-e2e-gce test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants