-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaemonSet controller doesn't handle NoSchedule taints correctly #48190
Comments
Summary: DaemonSet controller evicts running pods when a NoSchedule taint is added to the node. In this case the pod should instead continue to run. |
/cc |
I am still processing all the information being linked, and trying to summarize what I understand so far:
@gmarek @mikedanese Am I understanding this correctly? If my understanding above is correct, can someone help me to understand the following issues, so that we can decide how urgent it is and should we include this in v1.7.0:
I am inclined to punt this to v1.7.1 so that we can have a better understanding on the issue and proper fix with more test coverage. But I can be convinced on this by the severity in the production today. cc/ @davidopp @erictune @kubernetes/kubernetes-release-managers |
The problem is DS treats NoSchedule like it is NoExecute. The minute difference between the effects is NoSchedule taint should allow already running pods that are intolerant to continue running. Today it deletes them.
That was reverted in GCE before 1.6 due to backwards compatibility. The master in GKE doesn't even register.
Poor test coverage for a new feature (NoSchedule in DS introduced in 1.6, NoExecute in 1.7). This came from a few separate user reports.
GKE masters don't register. Taints are only relevant to nodes that register. |
The change I link includes the missing test coverage. |
This definitely needs to be fixed in 1.7. It's a serious bug. |
And we should cherrypick it into the next 1.6 patch release. |
So this one is making it to 1.7 ? |
I agreed this is a serious bug, but I am not still clear on why this should have such high severity in production, and we even want to delay the new release v1.7.0 for this last-minute-found regression introduced in v1.6. Basically the change / fix might have deep impact we might ignore for now under the release pressure. I would rather we speed time on a clear solution with a better test coverage and longer soaking time. |
Would it be possible to mark the Pods in question as tolerating the NoSchedule taint to prevent the eviction of Running critical Pods when a Node is tainted with NoSchedule? |
Given that this is in 1.6, IIUC, then deferring the fix to 1.7.1 seems reasonable to me. |
sig-cluster-lifecycle owns the DaemonSet controller and should make the decision (together with the release team). @kow3ns We could change the fluentd config to toleration NoSchedule taint, but that would only affect new pods. Not sure if it is worth it if we are going to fix it in 1.7.1. |
I thought that was sig-apps (or at least a joint effort between the two sigs). |
We need to have every controller owned by a single SIG (of course other SIGs can be "consultants"). I'm going to create a spreadsheet right now and send it out to kubernetes-dev for discussion. But that is for the long-term. For right now, if sig-apps owns the DaemonSet controller, then they should make the decision. |
(Actually I think we may have had that in the community repo somewhere... I'll look) |
Anyway it seems that most people seem to think we should not put the fix in 1.7, but put it in 1.7.1 and cherrypick to the next 1.6 release. I guess that is the right thing to do, but let's make sure the release notes are very clear about this problem. |
As for 1.6, I have been asked to cut 1.6.7 potentially as early as next week to fix another issue (#48108). Keeping in mind the holidays, do you think the fix for this DS issue will be ready to cherrypick by the end of this week? If not, please let me know if you consider this issue severe enough that I should hold the 1.6.7 train for it. |
Well, we need to find a reviewer for #48189. But since we are releasing 1.7 without this fix, I guess there's no need to hold 1.6.7 for it either. |
I think we should put it into 1.6.7 - this is an issue people are actually complaining about (I heard about it from multiple sources) |
xref: #46733 @mikedanese @gmarek can you put together a paragraph to describe this known issue for 1.7.0 and pre-1.6.7 release. Thanks! |
That's the most painful part, so I think that should be suffice. Assuming that we'll cut 1.6.8 in not too distant future. |
Automatic merge from submit-queue support NoSchedule taints correctly in DaemonSet controller Fixes #48190 ```release-note Support NoSchedule taints correctly in DaemonSet controller. ``` cc @kubernetes/sig-apps-pr-reviews
Will close after cherrypicks are in |
@mikedanese are we good to close this? |
Do we want to cherrypick to 1.6? I'd say no... |
Details in #44445 (comment)
@dchen1107 consider adding this to the 1.7 release although it appears to have been broken in 1.6. I'll add 1.7 milestone for you to remove if this doesn't make the cut.
Fix is here #48189
cc @kubernetes/sig-apps-bugs
The text was updated successfully, but these errors were encountered: