WIP: Give RWO PVs node affinity once they have attached #47749

wongma7 · 2017-06-19T21:24:10Z

Super WIP. It is a bug for kube to not enforce the RWO property of PVs: the most relevant discussion is I think #30085 but it is quite old. Situation recently encountered: RC has pod with PV on node A; node A is drained; new pod is started on node B and simultaneously writes to PV while old pod is still running & Terminating; data is corrupted

AFAICS node affinity added along with local storage can give us the enforcement we want, for free. I've created this PR just to see if it's feasible to "co-opt" the new node affinity feature and get some feedback/discussion. Idea is to, from kubelet volumemanager somewhere, automatically add the node affinity label to prevent the RWO PV from being used by pods on another node.

see also: kubernetes/community#124

k8s-github-robot · 2017-06-19T21:24:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wongma7
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

Associated issue: 30085

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these OWNERS Files:

pkg/kubelet/volumemanager/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

wongma7 · 2017-06-19T21:25:13Z

@gnufied @rootfs @msau42 idk who else to ping, not sure who was involved in design of local storage & PV node affinity. This code works as far as I can tell, but WIP!!

wongma7 · 2017-06-19T21:36:21Z

We might want to do it this way instead of doing it for all RWO volumes: https://github.com/kubernetes/community/pull/124/files/16f88595883a7461010b6708fb0e0bf1b046cf33#r121376117

In any case the basic question is can we use node affinity label as the "label" here since it already accomplishes a lot of what we want: "Attach and detach may
be recorded on the PVC or PV as a new field or materialized via the selection labels." https://github.com/smarterclayton/community/blob/16f88595883a7461010b6708fb0e0bf1b046cf33/contributors/design-proposals/pod-safety.md#storage-consistency

k8s-ci-robot · 2017-06-19T21:59:53Z

@wongma7: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-verify	`8c01466`	link	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

msau42 · 2017-06-19T22:12:45Z

In the example scenario, shouldn't the attach to node B have failed because it was already attached to node A?

When you detach, you have to remove the node affinity right? My main concern is that we would be mixing a user-specified attribute with a system-managed attribute. We wouldn't want to delete the node affinity if it was user specified. What happens if the user changes a system-specified attribute?

There's also the situation where if the user specifies more than one node in their node affinity, then they would not be protected by this scenario.

wongma7 · 2017-06-19T23:36:48Z

In the example scenario the PV is ISCSI. I don't know much about ISCSI tbh, and maybe this problem is in fact unique to ISCSI, I just know there is nothing preventing multiple writers from mounting a lun and corrupting the filesystem. The admin has declared that a PV is RWO because they know better than kube about the underlying infra and the ask here is that kube should enforce it for them to prevent unintentional corruption.

Yes we should remove the label after detach, only if we set it in the first place. Maybe we can track that with yet another annotation. And I agree we must make sure to not clobber existing annotations.

About if they have already specified more than one node on the PV; should we remove the other nodes and add them back at detach time? I'm leaning toward no, in general we should tend to do nothing in these edge cases as we do today and just let the majority case (blank RWO PV) reap the benefits.

(#30085 is a far more difficult problem about when the user of a PVC declares they want RWO...somehow the PV needs to "become" exclusively RWO before attach time in order to benefit from this PR.)

wongma7 · 2017-06-19T23:41:20Z

Hmm on second thought I think we might be able to solve #30085 too since we have the PV's claimRef and can parse if it asked specifically for RWO. I think ~~the main~~ one other issue with all this is it may be crossing a line about what responsibility kubelet volume manager has.

gnufied · 2017-06-20T02:50:16Z

I think the bug we were debugging was using Fiber Channels but same story as iSCSI. In that case, a user was able to always corrupt xfs file system on draining a node. User was using replication controller and hence he may have volume being mounted from more than 2 nodes at once. This is pretty serious problem that I think we should address.

I think we can get get about 80% there by using node affinity(at mount time) and preventing multi attach operation via check introduced here - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L139

Alternately, we discussed possibility of adding a field (or annotation) on PVCs that gets populated at mount time and is checked before scheduling a pod using same PVC.

cc @smarterclayton @thockin for input because you opened the original issue.

gnufied · 2017-06-20T02:51:53Z

Also ref - #47333

gnufied · 2017-06-20T03:04:53Z

Also, if we are going to update PV (or PVC) on mount (as this PR does) - we may be able to use that information as single source of truth for, whether PVC is being used in a pod. That will also help us with - #45143 issue.

msau42 · 2017-06-20T16:33:02Z

Do you want to also solve the issue of multiple pods on the same node writing to the same volume? Node affinity will not help for that case. Instead, you'll probably need some sort of refcounting

wongma7 · 2017-06-20T16:45:33Z

No, imo we should not change that behaviour.

wongma7 · 2017-06-20T17:10:35Z

Scratch that, maybe we should, but we can't use accessmodes to do it and it needs to be optional somehow. Access modes RWO is the only mechanism we've given users to express what their storage can tolerate but we have documented RWO to mean per-node. So if we would need some other new way for users to specify stricter single-pod-writer-only requirements (more accessmodes? This was proposed in a discussion somewhere) I mean in my example scenario, what if the user simply deleted the RC pod or more likely scaled it down and up a bit too fast, corruption could still occur. Their storage can only tolerate one writer. So imo we have to take small steps starting with something like this that leverages existing node affinity. Like, strictly speaking as described in the pod safety proposal we need fencing for 100% safety

…

On Tue, Jun 20, 2017 at 12:34 PM Michelle Au ***@***.***> wrote: Do you want to also solve the issue of multiple pods on the same node writing to the same volume? Node affinity will not help for that case. Instead, you'll probably need some sort of refcounting — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47749 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMgP-DHoE2PDNEXOIeKMA9znvxI7lsbuks5sF_R_gaJpZM4N-z0T> .

msau42 · 2017-06-20T17:19:40Z

If our end goal is to enforce mount RWO, then the solution for that could also solve attach RWO.

gnufied · 2017-06-20T17:33:18Z

@msau42 not entirely right? Currently in this PR we are checking for affinity at mount time. So if underlying storage provider can't safe guard against multiple attach, then volume would already be attached at that time.

So we would need separate check to prevent RWO attach right? We already kinda do (in the code I linked above), but it relies of Actual State of World to check state of PVC.

wongma7 · 2017-06-20T17:35:40Z

@msau42 Right, so depending on how you look at it :) maybe we should design & fix them together rather than fix the attach case first here and do the mount case later.

BTW, @rootfs mentioned today we already have a "NoDiskConflict" scheduler predicate! But it looks to me that it hasn't been updated in a while, it only looks at pod inline volume specs not PVs, but it does check that e.g. the same iscsi lun path is not used by two pods. The "TODO: migrate this into some per-volume specific code?"

kubernetes/plugin/pkg/scheduler/algorithm/predicates/predicates.go

Line 184 in f091722

    
           func NoDiskConflict(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {

sounds a lot like "https://github.com/kubernetes/community/pull/124/files/16f88595883a7461010b6708fb0e0bf1b046cf33#r121376117". So it's yet another thing to consider :/

rootfs · 2017-06-20T17:40:07Z

I wonder should we come up with a PV affinity schedule predicate similar to the anit affinity predicate NoDiksConflict

wongma7 · 2017-06-20T17:43:13Z

@rootfs I am leaning heavily towards no simply because there is too much overlap with the new PV node affinity predicate. To me, the PV node affinity label is the closest thing to what clayton described in the proposal and the easiest to reason about. Using it doesn't preclude us from doing the plugin-specific checks done in NoDiskConflict, we will just do them in a more reasonable location at a more reasonable time.

rootfs · 2017-06-20T17:46:24Z

@wongma7 how to deal with the conflict if the anti affinity predicate is turned on?

rootfs · 2017-06-20T17:52:33Z

adding #45346
@codablock

wongma7 · 2017-06-20T17:52:33Z

The anti affinity predicate only ever looks at pod volume spec, never volume.PersistentVolumeClaim so conflict can't occur since here we only look at PVCs.

…

On Tue, Jun 20, 2017 at 1:47 PM, Huamin Chen ***@***.***> wrote: @wongma7 <https://github.com/wongma7> how to deal with the conflict if the anti affinity predicate is turned on? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47749 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMgP-DJvqOkhDQNurNWPaLBqoVSOACyQks5sGAWhgaJpZM4N-z0T> .

rootfs · 2017-06-20T17:54:40Z

The anti affinity predicate only ever looks at pod volume spec, never
volume.PersistentVolumeClaim so conflict can't occur since here we only
look at PVCs.

Ok, i see it now.

rootfs · 2017-06-20T18:00:46Z

@wongma7 it doesn't seem to me node has the permission to update pv
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/auth/authorizer/node/node_authorizer.go#L90

gnufied · 2017-06-20T18:03:28Z

@rootfs that can be changed. @liggitt agreed to allow node to update PV/PVCs that are attached/mounted on it.

liggitt · 2017-06-20T18:06:15Z

@liggitt agreed to allow node to update PV/PVCs that are attached/mounted on it.

I don't recall that. Are you referring to the PV spec or status?

gnufied · 2017-06-20T19:05:09Z

I don't recall that. Are you referring to the PV spec or status?

Spoke with @liggitt offline. It is generally fine to allow update of status field of a PV/PVC that is strongly connected to the node. But since PV and PVC specs don't have node field them, the node authorizer can't determine relationship between node and PV/PVC and hence can't allow update of PV/PVC. :(

msau42 · 2017-06-20T19:30:57Z

I'm investigating the possibility of adding some sort of scheduled node annotation to the pv, as part of the local storage scheduling and binding integration. I'll be doing some prototyping to see if it could work.

wongma7 · 2017-06-20T21:53:41Z

So the thinking now is that we can't piggyback off the pv node annotations added for local storage like I want, we need to fix this "properly" more closely according to kubernetes/community#124. The basic idea is that preventing a pod from scheduling is not sufficient, we need locking of RWO PVs*. Via attach detach controller (add "logical attach" for plugins that are currently not attachable) or a new controller. @gnufied can elaborate if needed. Let's keep this open for now.

k8s-github-robot · 2017-06-22T22:42:46Z

@wongma7 PR needs rebase

wongma7 · 2017-06-29T01:41:58Z

Summing up conversations I've eavesdropped on:

we need a "logical attach" operation for currently non-attachable and problematic plugins. This is being implemented as part of the existing attach detach controller (roughly as described in pod safety proposal) but could be an all-new controller instead
to do the above, we need a way to associate a PV with a node. We already have one with node.status volumesAttached. But it's in one direction and backwards, and having a pv.status.node would be nice for various reasons.
scheduler is not the place to enforce safety guarantee but if we really want we could maybe update the existing nodiskconflict predicate to achieve roughly the same thing as this pr (imo)

WIP: Give RWO PVs node affinity once they have attached

8c01466

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2017

k8s-github-robot assigned dchen1107 and pmorie Jun 19, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jun 19, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2017

wongma7 closed this Jun 29, 2017

WIP: Give RWO PVs node affinity once they have attached #47749

WIP: Give RWO PVs node affinity once they have attached #47749

Uh oh!

Conversation

wongma7 commented Jun 19, 2017

Uh oh!

k8s-github-robot commented Jun 19, 2017

Uh oh!

wongma7 commented Jun 19, 2017

Uh oh!

wongma7 commented Jun 19, 2017

Uh oh!

k8s-ci-robot commented Jun 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msau42 commented Jun 19, 2017

Uh oh!

wongma7 commented Jun 19, 2017

Uh oh!

wongma7 commented Jun 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnufied commented Jun 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnufied commented Jun 20, 2017

Uh oh!

gnufied commented Jun 20, 2017

Uh oh!

msau42 commented Jun 20, 2017

Uh oh!

wongma7 commented Jun 20, 2017

Uh oh!

wongma7 commented Jun 20, 2017 via email

Uh oh!

msau42 commented Jun 20, 2017

Uh oh!

gnufied commented Jun 20, 2017

Uh oh!

wongma7 commented Jun 20, 2017

Uh oh!

rootfs commented Jun 20, 2017

Uh oh!

wongma7 commented Jun 20, 2017

Uh oh!

rootfs commented Jun 20, 2017

Uh oh!

rootfs commented Jun 20, 2017

Uh oh!

wongma7 commented Jun 20, 2017 via email

Uh oh!

rootfs commented Jun 20, 2017

Uh oh!

rootfs commented Jun 20, 2017

Uh oh!

gnufied commented Jun 20, 2017

Uh oh!

liggitt commented Jun 20, 2017

Uh oh!

gnufied commented Jun 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msau42 commented Jun 20, 2017

Uh oh!

wongma7 commented Jun 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-github-robot commented Jun 22, 2017

Uh oh!

wongma7 commented Jun 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

k8s-ci-robot commented Jun 19, 2017 •

edited

Loading

wongma7 commented Jun 19, 2017 •

edited

Loading

gnufied commented Jun 20, 2017 •

edited

Loading

gnufied commented Jun 20, 2017 •

edited

Loading

wongma7 commented Jun 20, 2017 •

edited

Loading