-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't try to attach volumes which are already attached to other nodes #45346
Don't try to attach volumes which are already attached to other nodes #45346
Conversation
Hi @codablock. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cc @jingxu97 |
// Check for volume types which are known to fail slow or cause trouble when trying to multi-attach | ||
if volumeSpec.Volume.AzureDisk != nil || | ||
volumeSpec.Volume.Cinder != nil { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to this page https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
There are 6 types of storage that support multi-write, while 13 of them not. Maybe we can make a map of it as a constant, and check whether the volume type is one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to do this refactoring after this PR is merged. Currently it's hard for me to do extensive testing as I'm working on non-k8s projects atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a note here for follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: #45596
@@ -124,6 +125,41 @@ func (rc *reconciler) syncStates() { | |||
rc.attacherDetacher.VerifyVolumesAreAttached(volumesPerNode, rc.actualStateOfWorld) | |||
} | |||
|
|||
// isMultiAttachForbidden checks if attaching this volume to multiple nodes is definitely not allowed/possible. | |||
// In it's current form, this function can only reliably say for which volumes it's definitely forbidden. If it returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In its current form
if volumeSpec.PersistentVolume != nil { | ||
if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 { | ||
// No access mode specified so we don't know for sure. Let the attacher fail if needed | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't happen, log an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AccessModes is optional according to
kubernetes/pkg/api/v1/types.go
Line 477 in 17d33ea
// AccessModes contains all ways the volume can be mounted. |
if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) { | ||
nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName) | ||
if len(nodes) > 0 { | ||
glog.V(4).Infof("Volume %q is already exclusively attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes, volumeToAttach.NodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is not spamming, V(3) is more helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is spamming (10 times per second) if it happens. On Azure it always happened while volumes were being moved to other nodes as long as it was not detached from the old node yet.
Most lgtm. Wait for @jingxu97 |
@k8s-bot ok to test |
!exponentialbackoff.IsExponentialBackoff(err) { | ||
// Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected. | ||
if err != nil && !exponentialbackoff.IsExponentialBackoff(err) { | ||
// Ignore exponentialbackoff.IsExponentialBackoff errors, they are expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove isAlreadyExists error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see our old discussion at #40148 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I am thinking although reconciler is currently the only place issue those volume operations, it might not be true in the future. So it might be better to put the isAlreadyExists error back so we know what is the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thought was the reason to remove the check. Please note that the check was there to actually prevent the logging of the error, as it was probably considered "normal" that it happened.
Before this PR, the reconciler would try to start the operation every 100ms, only succeeding the first time and erroring for every try that followed. This check was the reason that there was never an error message shown.
Now that we have the if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "")
check a few lines above, this error should not happen anymore and if it does for what ever reason, it should be logged.
// may pass while at the same time the volume leaves the pending state, resulting in | ||
// double detach attempts | ||
if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "") { | ||
glog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might forget, is this also a bug fix or just an improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see our old discussion at #40148 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
a8fe5a9
to
3308935
Compare
@codablock please address #45346 (comment) make a note on #45346 (comment) then lgtm |
3308935
to
3c02910
Compare
@rootfs Unfortunately GitHub does not show correct links anymore for links to comments, so I don't know which comments you actually meant. If you meant the "it's" thing...I just pushed that (for some reason I missed it in my last push). |
@codablock commented, please make a note on followup work on accessmode refactoring. |
/lgtm /assign @saad-ali for approval |
@codablock need a rebase and fix this
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @codablock.
/lgtm
/approve
Please fix the unit tests |
Something else I was thinking about: this will fundamentally change the way that users receive information about a common error case. Before this change, if a disk was With this change, the reconciler now doesn't even call operation_executor, and just silently refuses to attach the subsequent volumes, ( You can address this in this PR or open up another issue to address it a follow up PR, regardless I think it's worth fixing before 1.7 ships with this change. |
3c02910
to
dcff890
Compare
dcff890
to
06baeb3
Compare
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: codablock, rootfs, saad-ali
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@k8s-bot kubemark e2e test this |
@codablock: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue |
This PR is a replacement for #40148. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore.
CC @saad-ali You probably have to update the PR link in Q2 2017 (v1.7)
I assume the PR will need a new "ok to test"
ORIGINAL PR DESCRIPTION
This PR fixes an issue with the attach/detach volume controller. There are cases where the
desiredStateOfWorld
contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.
This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.
On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74
The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.
Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.
I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in
reconcile
.CC @colemickens @rootfs
-->