Skip to content

Conversation

@wongma7
Copy link
Contributor

@wongma7 wongma7 commented Jun 19, 2017

Super WIP. It is a bug for kube to not enforce the RWO property of PVs: the most relevant discussion is I think #30085 but it is quite old. Situation recently encountered: RC has pod with PV on node A; node A is drained; new pod is started on node B and simultaneously writes to PV while old pod is still running & Terminating; data is corrupted

AFAICS node affinity added along with local storage can give us the enforcement we want, for free. I've created this PR just to see if it's feasible to "co-opt" the new node affinity feature and get some feedback/discussion. Idea is to, from kubelet volumemanager somewhere, automatically add the node affinity label to prevent the RWO PV from being used by pods on another node.

see also: kubernetes/community#124

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wongma7
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

Associated issue: 30085

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jun 19, 2017
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 19, 2017

@gnufied @rootfs @msau42 idk who else to ping, not sure who was involved in design of local storage & PV node affinity. This code works as far as I can tell, but WIP!!

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 19, 2017

We might want to do it this way instead of doing it for all RWO volumes: https://github.com/kubernetes/community/pull/124/files/16f88595883a7461010b6708fb0e0bf1b046cf33#r121376117

In any case the basic question is can we use node affinity label as the "label" here since it already accomplishes a lot of what we want: "Attach and detach may
be recorded on the PVC or PV as a new field or materialized via the selection labels." https://github.com/smarterclayton/community/blob/16f88595883a7461010b6708fb0e0bf1b046cf33/contributors/design-proposals/pod-safety.md#storage-consistency

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 19, 2017

@wongma7: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-verify 8c01466 link /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@msau42
Copy link
Member

msau42 commented Jun 19, 2017

In the example scenario, shouldn't the attach to node B have failed because it was already attached to node A?

When you detach, you have to remove the node affinity right? My main concern is that we would be mixing a user-specified attribute with a system-managed attribute. We wouldn't want to delete the node affinity if it was user specified. What happens if the user changes a system-specified attribute?

There's also the situation where if the user specifies more than one node in their node affinity, then they would not be protected by this scenario.

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 19, 2017

In the example scenario the PV is ISCSI. I don't know much about ISCSI tbh, and maybe this problem is in fact unique to ISCSI, I just know there is nothing preventing multiple writers from mounting a lun and corrupting the filesystem. The admin has declared that a PV is RWO because they know better than kube about the underlying infra and the ask here is that kube should enforce it for them to prevent unintentional corruption.

Yes we should remove the label after detach, only if we set it in the first place. Maybe we can track that with yet another annotation. And I agree we must make sure to not clobber existing annotations.

About if they have already specified more than one node on the PV; should we remove the other nodes and add them back at detach time? I'm leaning toward no, in general we should tend to do nothing in these edge cases as we do today and just let the majority case (blank RWO PV) reap the benefits.

(#30085 is a far more difficult problem about when the user of a PVC declares they want RWO...somehow the PV needs to "become" exclusively RWO before attach time in order to benefit from this PR.)

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 19, 2017

Hmm on second thought I think we might be able to solve #30085 too since we have the PV's claimRef and can parse if it asked specifically for RWO. I think the main one other issue with all this is it may be crossing a line about what responsibility kubelet volume manager has.

@gnufied
Copy link
Member

gnufied commented Jun 20, 2017

I think the bug we were debugging was using Fiber Channels but same story as iSCSI. In that case, a user was able to always corrupt xfs file system on draining a node. User was using replication controller and hence he may have volume being mounted from more than 2 nodes at once. This is pretty serious problem that I think we should address.

I think we can get get about 80% there by using node affinity(at mount time) and preventing multi attach operation via check introduced here - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L139

Alternately, we discussed possibility of adding a field (or annotation) on PVCs that gets populated at mount time and is checked before scheduling a pod using same PVC.

cc @smarterclayton @thockin for input because you opened the original issue.

@gnufied
Copy link
Member

gnufied commented Jun 20, 2017

Also ref - #47333

@gnufied
Copy link
Member

gnufied commented Jun 20, 2017

Also, if we are going to update PV (or PVC) on mount (as this PR does) - we may be able to use that information as single source of truth for, whether PVC is being used in a pod. That will also help us with - #45143 issue.

@msau42
Copy link
Member

msau42 commented Jun 20, 2017

Do you want to also solve the issue of multiple pods on the same node writing to the same volume? Node affinity will not help for that case. Instead, you'll probably need some sort of refcounting

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 20, 2017

No, imo we should not change that behaviour.

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 20, 2017 via email

@msau42
Copy link
Member

msau42 commented Jun 20, 2017

If our end goal is to enforce mount RWO, then the solution for that could also solve attach RWO.

@gnufied
Copy link
Member

gnufied commented Jun 20, 2017

@msau42 not entirely right? Currently in this PR we are checking for affinity at mount time. So if underlying storage provider can't safe guard against multiple attach, then volume would already be attached at that time.

So we would need separate check to prevent RWO attach right? We already kinda do (in the code I linked above), but it relies of Actual State of World to check state of PVC.

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 20, 2017

@msau42 Right, so depending on how you look at it :) maybe we should design & fix them together rather than fix the attach case first here and do the mount case later.

BTW, @rootfs mentioned today we already have a "NoDiskConflict" scheduler predicate! But it looks to me that it hasn't been updated in a while, it only looks at pod inline volume specs not PVs, but it does check that e.g. the same iscsi lun path is not used by two pods. The "TODO: migrate this into some per-volume specific code?"

func NoDiskConflict(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
sounds a lot like "https://github.com/kubernetes/community/pull/124/files/16f88595883a7461010b6708fb0e0bf1b046cf33#r121376117". So it's yet another thing to consider :/

@rootfs
Copy link
Contributor

rootfs commented Jun 20, 2017

I wonder should we come up with a PV affinity schedule predicate similar to the anit affinity predicate NoDiksConflict

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 20, 2017

@rootfs I am leaning heavily towards no simply because there is too much overlap with the new PV node affinity predicate. To me, the PV node affinity label is the closest thing to what clayton described in the proposal and the easiest to reason about. Using it doesn't preclude us from doing the plugin-specific checks done in NoDiskConflict, we will just do them in a more reasonable location at a more reasonable time.

@rootfs
Copy link
Contributor

rootfs commented Jun 20, 2017

@wongma7 how to deal with the conflict if the anti affinity predicate is turned on?

@rootfs
Copy link
Contributor

rootfs commented Jun 20, 2017

adding #45346
@codablock

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 20, 2017 via email

@rootfs
Copy link
Contributor

rootfs commented Jun 20, 2017

The anti affinity predicate only ever looks at pod volume spec, never
volume.PersistentVolumeClaim so conflict can't occur since here we only
look at PVCs.

Ok, i see it now.

@rootfs
Copy link
Contributor

rootfs commented Jun 20, 2017

@gnufied
Copy link
Member

gnufied commented Jun 20, 2017

@rootfs that can be changed. @liggitt agreed to allow node to update PV/PVCs that are attached/mounted on it.

@liggitt
Copy link
Member

liggitt commented Jun 20, 2017

@liggitt agreed to allow node to update PV/PVCs that are attached/mounted on it.

I don't recall that. Are you referring to the PV spec or status?

@gnufied
Copy link
Member

gnufied commented Jun 20, 2017

I don't recall that. Are you referring to the PV spec or status?

Spoke with @liggitt offline. It is generally fine to allow update of status field of a PV/PVC that is strongly connected to the node. But since PV and PVC specs don't have node field them, the node authorizer can't determine relationship between node and PV/PVC and hence can't allow update of PV/PVC. :(

@msau42
Copy link
Member

msau42 commented Jun 20, 2017

I'm investigating the possibility of adding some sort of scheduled node annotation to the pv, as part of the local storage scheduling and binding integration. I'll be doing some prototyping to see if it could work.

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 20, 2017

So the thinking now is that we can't piggyback off the pv node annotations added for local storage like I want, we need to fix this "properly" more closely according to kubernetes/community#124. The basic idea is that preventing a pod from scheduling is not sufficient, we need locking of RWO PVs*. Via attach detach controller (add "logical attach" for plugins that are currently not attachable) or a new controller. @gnufied can elaborate if needed. Let's keep this open for now.

@k8s-github-robot
Copy link

@wongma7 PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2017
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 29, 2017

Summing up conversations I've eavesdropped on:

  • we need a "logical attach" operation for currently non-attachable and problematic plugins. This is being implemented as part of the existing attach detach controller (roughly as described in pod safety proposal) but could be an all-new controller instead
  • to do the above, we need a way to associate a PV with a node. We already have one with node.status volumesAttached. But it's in one direction and backwards, and having a pv.status.node would be nice for various reasons.
  • scheduler is not the place to enforce safety guarantee but if we really want we could maybe update the existing nodiskconflict predicate to achieve roughly the same thing as this pr (imo)

@wongma7 wongma7 closed this Jun 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants