Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OwnerReferencesPermissionEnforcement ignores pods/status #45747

Merged

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented May 12, 2017

What this PR does / why we need it:
It makes the OwnerReferencesPermissionEnforcement admission plug-in ignore pods/status operations. If a cluster has version skew with its master (1.6) and nodes (1.6-N), the kubelet is doing a pods/status operation and that code path is causing an error because ownerRef has new fields unknown to kubelet.

Release note:

OwnerReferencesPermissionEnforcement admission plugin ignores pods/status.

@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 12, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 12, 2017
@derekwaynecarr derekwaynecarr changed the title WIP: OwnerReferencesPermissionEnforcement ignores pods marked for deletion OwnerReferencesPermissionEnforcement ignores pods marked for deletion May 12, 2017
@derekwaynecarr derekwaynecarr changed the title OwnerReferencesPermissionEnforcement ignores pods marked for deletion WIP: OwnerReferencesPermissionEnforcement ignores pods marked for deletion May 12, 2017
@derekwaynecarr
Copy link
Member Author

/cc @deads2k @smarterclayton @eparis -- i think this should stop the bleeding we are experiencing right now until we understand the root cause better.

@derekwaynecarr derekwaynecarr changed the title WIP: OwnerReferencesPermissionEnforcement ignores pods marked for deletion OwnerReferencesPermissionEnforcement ignores pods marked for deletion May 12, 2017
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 12, 2017
@deads2k
Copy link
Contributor

deads2k commented May 15, 2017

/cc @deads2k @smarterclayton @eparis -- i think this should stop the bleeding we are experiencing right now until we understand the root cause better.

@derekwaynecarr I'm against this. We haven't determined that the kubelet isn't really mutating it. How about a reproducer that uses node permissions with a different user and then update status without any change via yaml to show it.

Example of a simple story where kubelet would be modifying the ownerRefs.

  1. old kubelet does a get, then decodes, resulting in dropping new fields like blockownerdeletion.
  2. old kubelet updates status and doesn't send back blockownerdeletion
  3. the new api server sees a difference and (properly) rejects in admission, thus saving our metadata from corruption.

Dropping out new metadata fields on status updates because it's deleted practically guarantees GC problems.

@deads2k
Copy link
Contributor

deads2k commented May 15, 2017

/cc @deads2k @smarterclayton @eparis -- i think this should stop the bleeding we are experiencing right now until we understand the root cause better.

Also, if you just want to clear them out (though getting a reproduction REST request seems really important), simply create a role with the permission to delete pods/status and bind it to the kubelets. Should clear out, but I'd expect that to do bad things to GC since it will allow whatever mutation is happening.

@derekwaynecarr
Copy link
Member Author

@deads2k -- i was under the impression that all nodes had been upgraded in this environment. for your scenario, are you basically saying that old kubelets can no longer communicate to new api servers if this plug-in is enabled? isn't that a problem?

@deads2k
Copy link
Contributor

deads2k commented May 15, 2017

@deads2k -- i was under the impression that all nodes had been upgraded in this environment. for your scenario, are you basically saying that old kubelets can no longer communicate to new api servers if this plug-in is enabled? isn't that a problem?

It's another reason why #45552 is a good idea. @lavalamp

You could try patching pod/status from the kubelet.

@derekwaynecarr
Copy link
Member Author

@deads2k -- this does appear tied to upgrade scenarios, i think its worth asking if users should expect their 1.5 kubelets to stop working when upgrading to 1.6 masters with n-2 version skew. to properly fix this, we would need patches to 1.4, 1.5, 1.6 kubelets. i think we should evaluate taking this change in addition to getting work lined up to change how kubelet sends status.

@deads2k
Copy link
Contributor

deads2k commented May 15, 2017

think we should evaluate taking this change in addition to getting work lined up to change how kubelet sends status.

This change allows data corruption. This admission plugin caught a data corruption problem by accident. "Fixing" the admission plugin to allow the current actors to corrupt data is a not a reasonable solution. Something like #45826 would prevent the kubelet from accidentally destroying data like this, but the kubelet should be updated because you may not be able to fix the next one as easily.

@derekwaynecarr
Copy link
Member Author

@deads2k -- telling users to upgrade old kubelets before upgrading masters seems like a non-tenable solution. i agree we need to fix future kubelets, but requiring to upgrade old kubelets appears to me to be a non-starter.

@derekwaynecarr
Copy link
Member Author

@kubernetes/sig-node-bugs --- i added a discussion of the source problem this PR exposed in our next sig meeting. it appears that kubelets cannot safely update status without doing a patch.

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 15, 2017
@k8s-github-robot k8s-github-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2017
@derekwaynecarr derekwaynecarr changed the title OwnerReferencesPermissionEnforcement ignores pods marked for deletion OwnerReferencesPermissionEnforcement ignores pods/status May 15, 2017
@derekwaynecarr
Copy link
Member Author

@deads2k -- i updated this to use the whitelist as discussed.

@deads2k
Copy link
Contributor

deads2k commented May 15, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, derekwaynecarr

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2017
@caesarxuchao
Copy link
Member

@derekwaynecarr could you link to the bug that's being fixed?

IIUC, this change can be reverted when the all kubelet are updated to use patch when updating node status, right? Can you open an issue to remind ourselves to remove the whitelist in future releases?

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 45826, 45747, 45548, 45606, 41766)

@k8s-github-robot k8s-github-robot merged commit b3031c2 into kubernetes:master May 15, 2017
@eparis
Copy link
Contributor

eparis commented May 15, 2017

It was found in https://bugzilla.redhat.com/show_bug.cgi?id=1450461 which i think has basically no public data.I'll see what all I can open.

@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.6" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants