-
Notifications
You must be signed in to change notification settings - Fork 42.1k
WIP: Give RWO PVs node affinity once they have attached #47749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wongma7 Assign the PR to them by writing Associated issue: 30085 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these OWNERS Files:You can indicate your approval by writing |
|
We might want to do it this way instead of doing it for all RWO volumes: https://github.com/kubernetes/community/pull/124/files/16f88595883a7461010b6708fb0e0bf1b046cf33#r121376117 In any case the basic question is can we use node affinity label as the "label" here since it already accomplishes a lot of what we want: "Attach and detach may |
|
@wongma7: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
In the example scenario, shouldn't the attach to node B have failed because it was already attached to node A? When you detach, you have to remove the node affinity right? My main concern is that we would be mixing a user-specified attribute with a system-managed attribute. We wouldn't want to delete the node affinity if it was user specified. What happens if the user changes a system-specified attribute? There's also the situation where if the user specifies more than one node in their node affinity, then they would not be protected by this scenario. |
|
In the example scenario the PV is ISCSI. I don't know much about ISCSI tbh, and maybe this problem is in fact unique to ISCSI, I just know there is nothing preventing multiple writers from mounting a lun and corrupting the filesystem. The admin has declared that a PV is RWO because they know better than kube about the underlying infra and the ask here is that kube should enforce it for them to prevent unintentional corruption. Yes we should remove the label after detach, only if we set it in the first place. Maybe we can track that with yet another annotation. And I agree we must make sure to not clobber existing annotations. About if they have already specified more than one node on the PV; should we remove the other nodes and add them back at detach time? I'm leaning toward no, in general we should tend to do nothing in these edge cases as we do today and just let the majority case (blank RWO PV) reap the benefits. (#30085 is a far more difficult problem about when the user of a PVC declares they want RWO...somehow the PV needs to "become" exclusively RWO before attach time in order to benefit from this PR.) |
|
Hmm on second thought I think we might be able to solve #30085 too since we have the PV's claimRef and can parse if it asked specifically for RWO. I think |
|
I think the bug we were debugging was using Fiber Channels but same story as iSCSI. In that case, a user was able to always corrupt xfs file system on draining a node. User was using replication controller and hence he may have volume being mounted from more than 2 nodes at once. This is pretty serious problem that I think we should address. I think we can get get about 80% there by using node affinity(at mount time) and preventing multi attach operation via check introduced here - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L139 Alternately, we discussed possibility of adding a field (or annotation) on PVCs that gets populated at mount time and is checked before scheduling a pod using same PVC. cc @smarterclayton @thockin for input because you opened the original issue. |
|
Also ref - #47333 |
|
Also, if we are going to update PV (or PVC) on mount (as this PR does) - we may be able to use that information as single source of truth for, whether PVC is being used in a pod. That will also help us with - #45143 issue. |
|
Do you want to also solve the issue of multiple pods on the same node writing to the same volume? Node affinity will not help for that case. Instead, you'll probably need some sort of refcounting |
|
No, imo we should not change that behaviour. |
|
Scratch that, maybe we should, but we can't use accessmodes to do it and it
needs to be optional somehow.
Access modes RWO is the only mechanism we've given users to express what
their storage can tolerate but we have documented RWO to mean per-node. So
if we would need some other new way for users to specify stricter
single-pod-writer-only requirements (more accessmodes? This was proposed in
a discussion somewhere)
I mean in my example scenario, what if the user simply deleted the RC pod
or more likely scaled it down and up a bit too fast, corruption could still
occur. Their storage can only tolerate one writer.
So imo we have to take small steps starting with something like this that
leverages existing node affinity. Like, strictly speaking as described in
the pod safety proposal we need fencing for 100% safety
…On Tue, Jun 20, 2017 at 12:34 PM Michelle Au ***@***.***> wrote:
Do you want to also solve the issue of multiple pods on the same node
writing to the same volume? Node affinity will not help for that case.
Instead, you'll probably need some sort of refcounting
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47749 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMgP-DHoE2PDNEXOIeKMA9znvxI7lsbuks5sF_R_gaJpZM4N-z0T>
.
|
|
If our end goal is to enforce mount RWO, then the solution for that could also solve attach RWO. |
|
@msau42 not entirely right? Currently in this PR we are checking for affinity at mount time. So if underlying storage provider can't safe guard against multiple attach, then volume would already be attached at that time. So we would need separate check to prevent RWO attach right? We already kinda do (in the code I linked above), but it relies of Actual State of World to check state of PVC. |
|
@msau42 Right, so depending on how you look at it :) maybe we should design & fix them together rather than fix the attach case first here and do the mount case later. BTW, @rootfs mentioned today we already have a "NoDiskConflict" scheduler predicate! But it looks to me that it hasn't been updated in a while, it only looks at pod inline volume specs not PVs, but it does check that e.g. the same iscsi lun path is not used by two pods. The "TODO: migrate this into some per-volume specific code?"
|
|
I wonder should we come up with a PV affinity schedule predicate similar to the anit affinity predicate NoDiksConflict |
|
@rootfs I am leaning heavily towards no simply because there is too much overlap with the new PV node affinity predicate. To me, the PV node affinity label is the closest thing to what clayton described in the proposal and the easiest to reason about. Using it doesn't preclude us from doing the plugin-specific checks done in NoDiskConflict, we will just do them in a more reasonable location at a more reasonable time. |
|
@wongma7 how to deal with the conflict if the anti affinity predicate is turned on? |
|
adding #45346 |
|
The anti affinity predicate only ever looks at pod volume spec, never
volume.PersistentVolumeClaim so conflict can't occur since here we only
look at PVCs.
…On Tue, Jun 20, 2017 at 1:47 PM, Huamin Chen ***@***.***> wrote:
@wongma7 <https://github.com/wongma7> how to deal with the conflict if
the anti affinity predicate is turned on?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47749 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMgP-DJvqOkhDQNurNWPaLBqoVSOACyQks5sGAWhgaJpZM4N-z0T>
.
|
Ok, i see it now. |
|
@wongma7 it doesn't seem to me node has the permission to update pv |
I don't recall that. Are you referring to the PV spec or status? |
Spoke with @liggitt offline. It is generally fine to allow update of status field of a PV/PVC that is strongly connected to the node. But since PV and PVC specs don't have node field them, the node authorizer can't determine relationship between node and PV/PVC and hence can't allow update of PV/PVC. :( |
|
I'm investigating the possibility of adding some sort of scheduled node annotation to the pv, as part of the local storage scheduling and binding integration. I'll be doing some prototyping to see if it could work. |
|
So the thinking now is that we can't piggyback off the pv node annotations added for local storage like I want, we need to fix this "properly" more closely according to kubernetes/community#124. The basic idea is that preventing a pod from scheduling is not sufficient, we need locking of RWO PVs*. Via attach detach controller (add "logical attach" for plugins that are currently not |
|
@wongma7 PR needs rebase |
|
Summing up conversations I've eavesdropped on:
|
Super WIP. It is a bug for kube to not enforce the RWO property of PVs: the most relevant discussion is I think #30085 but it is quite old. Situation recently encountered: RC has pod with PV on node A; node A is drained; new pod is started on node B and simultaneously writes to PV while old pod is still running & Terminating; data is corrupted
AFAICS node affinity added along with local storage can give us the enforcement we want, for free. I've created this PR just to see if it's feasible to "co-opt" the new node affinity feature and get some feedback/discussion. Idea is to, from kubelet volumemanager somewhere, automatically add the node affinity label to prevent the RWO PV from being used by pods on another node.
see also: kubernetes/community#124