-
Notifications
You must be signed in to change notification settings - Fork 42.1k
Don't try to attach volumes which are already attached to other nodes #40148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @codablock. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@k8s-bot ok to test |
| glog.V(5).Infof("Volume %q is already attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes[0], volumeToAttach.NodeName) | ||
| continue | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for this in reconciler_test.go ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 2 tests, one which tests the initial issue this PR fixes and one that tests if a volume is reattached to another node when it is detached from the first node.
|
@codablock I hit the same idea before. The problem with this approach is that there could exist a race window when no node is detected to attach the volume and before the volume is to be attached. |
|
@rootfs Can you describe this in more detail? Not sure if I understand you correctly. |
13ba472 to
7f8dca9
Compare
|
Had a short discussion on Slack and it looks like I need to take care of cases where volumes have other access modes then ReadWriteOnce. An alternative would be to check for volume types that don't support multi node attachment, e.g. all block devices. |
|
@codablock I think you can change the function GetNodesForVolume to GetExclusiveNodesForVolume which only returns nodes that the volume attached to with readwrite option. It checks whether the the readOnly spec is set to true or not, and return empty list if it is true. |
|
@jingxu97 Thanks for the comment and your offer to help on Slack. As I see you worked on a few parts of the volume management, so you'll probably be able to answer my questions.
|
|
[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files: We suggest the following people: |
|
ping @saad-ali for merge + cherry-pick. |
|
Do not merge atm please. I still need to add a check for ReadWriteOnce. |
|
@rootfs I think I found the race condition you mentioned. After the reconciler starts attachment for node-1, this is not reported to actualStateOfTheWorld until attachment has completed in the background. This means, attachment for node-2 will start immediately in parallel. Did you already think about a solution? I'm thinking about checking for pending operations as well as a solution. |
3e369a8 to
b4caac5
Compare
|
I've just pushed a new version with some additional checks and tests. I had to change the way alreadyExistsError was handled. Instead of letting AttachVolume/DetachVolume return an error and check for the error, I now check for IsOperationPending before I even try to start the operation. This was needed to have a predictable number of calls to NewAttacher so that the tests can reliably wait for an expected count when multiple nodes are involved. |
0f35498 to
f4a46c8
Compare
|
I'm sorry for all the build failures. Locally everything was working fine, even with |
|
@k8s-bot bazel test this |
|
@k8s-bot unit test this |
b90d1d9 to
f978727
Compare
| !exponentialbackoff.IsExponentialBackoff(err) { | ||
| // Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected. | ||
| if err != nil && !exponentialbackoff.IsExponentialBackoff(err) { | ||
| // Ignore exponentialbackoff.IsExponentialBackoff errors, they are expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is not necessary to remove !nestedpendingoperations.IsAlreadyExists(err) since this error still might be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the reconciler loop is the only place where operations might be started, and the loop is a single goroutine, I would expect that it's not possible to have this error returned when a previous IsOperationPending call returned false. If it still happens, there is a bug and I would prefer to see the error in the log.
Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It should not happen.
| if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) { | ||
| nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName) | ||
| if len(nodes) > 0 { | ||
| glog.V(5).Infof("Volume %q is already exclusively attached to node %q and can't be attached to %q", volumeToAttach.VolumeName, nodes, volumeToAttach.NodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to make to level 4 for this log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
f978727 to
23d7c87
Compare
|
lgtm. @saad-ali you want to take a look too? |
|
I just found #39055 because someone on Slack posted a link to it. It looks like that PR introduced the issue/question #39791. If this PR gets merged, #39791 should probably be fixed as well as the attachdetach controller won't try to attach the volume to the new node before detach is finished. This would however require to add the cinder volumes to the check in isMultiAttachForbidden the same way as Azure disks are checked. If this is ok, I'll add and push this change. |
|
Yes @codablock If so, cinder should also be added in isMultiAttachForbidden to forbid multiple attachment |
23d7c87 to
58d02c1
Compare
|
I've added Cinder to isMultiAttachForbidden. @saad-ali Ping regarding review ;) |
|
@k8s-bot kops aws e2e test this |
1 similar comment
|
@k8s-bot kops aws e2e test this |
|
Taking a look |
|
@saad-ali @jingxu97 It turned out there is "volume conflict checking" in the scheduler which was missing for Azure, resulting in all the problems I tried to solve with this PR. I'm not sure how to proceed now. Please see #41398 (comment) for more details. No matter how we continue with this PR, a fix in the scheduler is needed. |
|
@codablock I'd like to better understand the problems that Azure volumes are having. It sounds like some bigger infrastructure changes maybe needed to fix the issues. We are about 10 minutes away from code freeze for 1.6. While we can can get small bug fixes in during code freeze, the big changes will have to wait for 1.7. But I really want to make sure we don't miss 1.7. So how about we set up a meeting early in the 1.7 dev cycle (I'm thinking 2nd week of April), to discuss what the pain points are, propose some changes, and hopefully get them implemented in 1.7? CC @kubernetes/sig-storage-misc |
|
@saad-ali As I won't have time to work on Kubernetes and/or Azure in the next few month I would suggest that @colemickens and @khenidak take over the discussion and do a meeting with you if required. @khenidak is working on Azure managed disks and as I understood many of the problems I encountered with Azure Disks and which I tried to fix with PRs like these are also fixed in his work. So maybe this PR isn't even needed anymore. |
|
One of the items on the storage backlog for Q2 2017 (v1.7) is improving Azure support. We should consider this PR as part of that effort. CC @rootfs @chakri-nelluri for review |
|
Ack. Tracking both PRs |
|
REPLACED BY #45346 |
Automatic merge from submit-queue Don't try to attach volumes which are already attached to other nodes This PR is a replacement for #40148. I was not able to push fixes and rebases to the original branch as I don't have access to the Github organization anymore. CC @saad-ali You probably have to update the PR link in [Q2 2017 (v1.7)](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/edit#gid=14624465) I assume the PR will need a new "ok to test" **ORIGINAL PR DESCRIPTION** This PR fixes an issue with the attach/detach volume controller. There are cases where the `desiredStateOfWorld` contains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ... I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately. This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully. On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time. When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74 The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching. Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above. I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in `reconcile`. CC @colemickens @rootfs --> ```release-note Don't try to attach volume to new node if it is already attached to another node and the volume does not support multi-attach. ```
| } | ||
|
|
||
| if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) { | ||
| nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If set pv with accessMode 'ReadWriteOnce',pod with pv running on node1.
when node1 is down,schedule pod to node2,new pod waitting for attach,but old pod is still not detach if node1 is still down
isMultiAttachForbidden and GetNodeForVolume will produce
attachdetach-controller Multi-Attach error for volume "pvc-d0fde86c-8661-11e9-b873-0800271c9f15" Volume is already used by pod
EDIT: REPLACED BY #45346
This PR fixes an issue with the attach/detach volume controller. There are cases where the
desiredStateOfWorldcontains the same volume for multiple nodes, resulting in the attach/detach controller attaching this volume to multiple nodes. This of course fails for volumes like AWS EBS, Azure Disks, ...I observed this situation on Azure when using Azure Disks and replication controllers which start to reschedule PODs. When you delete a POD that belongs to a RC, the RC will immediately schedule a new POD on another node. This results in a short time (max a few seconds) where you have 2 PODs which try to attach/mount the same volume on different nodes. As the old POD is still alive, the attach/detach controller does not try to detach the volume and starts to attach the volume to the new POD immediately.
This behavior was probably not noticed before on other clouds as the bogus attempt to attach probably fails pretty fast and thus is unnoticed. As the situation with the 2 PODs disappears after a few seconds, a detach for the old POD is initiated and thus the new POD can attach successfully.
On Azure however, attaching and detaching takes quite long, resulting in the first bogus attach attempt to already eat up much time.
When attaching fails on Azure and reports that it is already attached somewhere else, the cloud provider immediately does a detach call for the same volume+node it tried to attach to. This is done to make sure the failed attach request is aborted immediately. You can find this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/azure/azure_storage.go#L74
The complete flow of attach->fail->abort eats up valuable time and the attach/detach controller can not proceed with other work while this is happening. This means, if the old POD disappears in the meantime, the controller can't even start the detach for the volume which delays the whole process of rescheduling and reattaching.
Also, I and other people have observed very strange behavior where disks ended up being "attached" to multiple VMs at the same time as reported by Azure Portal. This results in the controller to fail reattaching forever. It's hard to figure out why and when this happens and there is no reproducer known yet. I can imagine however that the described behavior correlates with what I described above.
I was not sure if there are actually cases where it is perfectly fine to have a volume mounted to multiple PODs/nodes. At least technically, this should be possible with network based volumes, e.g. nfs. Can someone with more knowledge about volumes help me here? I may need to add a check before skipping attaching in
reconcile.CC @colemickens @rootfs
-->