-
Notifications
You must be signed in to change notification settings - Fork 42.1k
Add policy ConfigMap monitoring and restart to the scheduler. #44805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/unassign erictune /unassign enisoc |
97efff2 to
0d5682c
Compare
|
@kubernetes/sig-scheduling-pr-reviews |
|
|
||
| sharedIndexInformer.AddEventHandlerWithResyncPeriod( | ||
| cache.ResourceEventHandlerFuncs { | ||
| AddFunc: nil, // scheduler aborts if its policy ConfigMap does not exist. So, we do not need and "add" handler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/and/an/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, maybe say "if its policy ConfigMap does not exist and --<whatever the flag to enable legacy mode is> is enabled"
| } | ||
|
|
||
| func (sc *schedulerConfigurator) addPolicyConfigMap(obj interface{}) { | ||
| glog.Info("Received a request to add a scheduler policy config.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure the policy is logged on scheduler startup. (I didn't check; it may be already.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getSchedulerPolicyConfig is logging the contents at V(5) level, but I am going to change it to glog.Info.
| SchedulerKillFunc() | ||
| } else { | ||
| if sc.schedulerConfig != nil { | ||
| glog.Infof("Scheduler is going to die (and restarted) in order to update its policy.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add something like "(ConfigMap %s/%s changed)", configmap namespace, configmap name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these to the handler logs.
| "k8s.io/kubernetes/test/integration/framework" | ||
| ) | ||
|
|
||
| const defaultSchedulerPolicyConfigMap string = "scheduler-custom-policy-config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I realize this is just a test, but this is a bit long and I think more Kubernetes-standard would be camel-case, so something like schedulerCustomPolicy
| ) | ||
| } | ||
|
|
||
| // verifyNewPolicyConfig verifies that the new object received by the ConfigMap watch is a valid policy config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verifying the ConfigMap is an interesting idea (and clearly took some thought and work) but I actually think you shouldn't do it. More specifically, I think it will be confusing to the user if the scheduler is using an old ConfigMap that is cached in memory after the ConfigMap in etcd has changed. Someone who walks up to the cluster won't be able to look at the ConfigMap to know what the current scheduler behavior is. A related concern is that I think it's better to give the error about the invalid ConfigMap and die right when the ConfigMap is changed (to something invalid) and there is probably a human involved. If you let them get out of sync when the validation fails, then if the scheduler later crashes, when it restarts it will give an error but there may not be anyone around to deal with it.
Of course, on startup we should verify the ConfigMap and exit if it doesn't validate. (I assume we're doing that already; I didn't check.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had verified the ConfigMap on startup; if failed, scheduler exit. One purpose of this check is to avoid keep restarting scheduler when ConfigMap invalid after updating.
Anyway, I think we should add an event (and log) to show whether new config was accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Klaus mentioned, if we don't check the validity before restarting, scheduler will crash-loop, which may be acceptable if we assume there is a human watching the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think crash looping when a human pushes a bad config is better than crash looping later if the scheduler crashes on its own due to a bug or machine problem or whatever, when there may be no human around.
| glog.Infof("Scheduler is going to die (and restarted) in order to update its policy.") | ||
| close(sc.schedulerConfig.StopEverything) | ||
| // The sleep is only to allow cleanups to happen. | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write a comment about how you chose 2 seconds? (Even if it was just chosen randomly...)
| glog.Flush() | ||
| os.Exit(0) | ||
| } | ||
| glog.Infof("Scheduler is not going to exit, as it doesn't seem to be initialized yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about this case. Did you observe this in reality? Would it be better to just exit in this case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't observe this for updates, but I had initially added a handler for the addition of the ConfigMap as well. Kube-scheduler was sometimes getting the add notification before the scheduler was up. In that scenario, the addition event was being received asynchronously and killing the scheduler for that scenario wouldn't work, but now that we don't monitor the addition of the ConfigMap, I guess we can kill the scheduler regardless.
| t.Error("Scheduler killer function is expected to be called upon update to its policy ConfigMap.") | ||
| } | ||
|
|
||
| // Scheduler should be killed when its policy ConfigMap is deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for completeness, can you test both deletion and update? (even if you just copy the first half of this test into a new test and change it to update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test tests both update and restart. Well, the update part needs a comment, but I can split the two.
| informerFactory.Core().V1().Services(), | ||
| eventBroadcaster.NewRecorder(api.Scheme, clientv1.EventSource{Component: v1.DefaultSchedulerName}), | ||
| ) | ||
| sched, err := app.CreateScheduler(ss, clientSet, informerFactory, eventBroadcaster.NewRecorder(api.Scheme, clientv1.EventSource{Component: v1.DefaultSchedulerName})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you left some extra whitespace in the middle of the line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the extra whitespace. What is the word before the space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess you are referring to what is shown here before eventBroadcaster. It is not in the code. Maybe it is a github issue in showing diffs.
e519f57 to
ce1b9e3
Compare
|
/cc @aveshagarwal @jayunit100 Any thoughts on killing the scheduler progress when configMap changed? |
|
|
||
| func (sc *schedulerConfigurator) SetupPolicyConfigMapEventHandlers(client clientset.Interface, informerFactory informers.SharedInformerFactory) { | ||
| // selector targets only the scheduler's policy ConfigMap. | ||
| selector := cache.NewListWatchFromClient(client.CoreV1().RESTClient(), "configmaps", sc.policyConfigMapNamespace, fields.OneTermEqualSelector(api.ObjectNameField, string(sc.policyConfigMap))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems very odd to go to all the trouble to make the scheduler watch the API, only as a signal to kill itself. why not just poll the specified config file if all it is doing is looking for changes in order to exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no config file. The new, preferred way of configuring scheduler policies is to use a ConfigMap object. Use of a ConfigMap object lets us use an API call to update scheduler configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no config file. The new, preferred way of configuring scheduler policies is to use a ConfigMap object.
ConfigMaps are designed to be able to be consumed as files. Tightly coupling the config source to the API limits flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to go over the advantages and disadvantages of using a file vs. a ConfigMap in this design doc. Given that we still support legacy config files as well, we thought using a ConfigMap would be simpler to implement and won't sacrifice flexibility. That part is already merged as a different PR, but if there are concerns which are not addressed, I will be happy to have a meeting to discuss the options.
|
rather than making the kill-and-restart approach a fundamental part of the scheduler, we should start working toward the scheduler being able to reload from config cleanly. This is a good opportunity to refactor the piece that depends on the changeable config into an isolated function that either takes a stop channel (and can therefore be cleanly stopped/reconstructed) or takes a config channel (and can therefore be cleanly fed a replacement config) |
|
So I know there are a bunch of dynamic configuration mechanisms in flight right now - I thought (@thockin, @mtaufen correct me if i'm wrong) that some of the emerging consensus from work on kube-dns and others was that we would use the config file on disk, driven by kubernetes, to read content dynamically. If we are not all in alignment, when can we sync back up and get on the same page? |
|
@smarterclayton I have been aware of those efforts and talked with mtaufen and Biran Grant at least a couple of times about the right approach to use for scheduler. I also wrote this document to explain the pros and cons of using a config file (or ConfigMap backed by a file) vs. using a ConfigMap object directly. We eventually decided to use the ConfigMap object directly for two main reasons: 1. it does not require a sidecar in hosted solutions, 2. monitoring a ConfigMap object can be done with the existing kubernetes informers, instead of inotify, etc. More details are provided in the document. |
This seems like a reasonable "future work" that is compatible with kill-and-restart as the first implementation, to let us get this useful feature out in the wild more quickly. Also, having components reload their config cleanly is something that is needed by many components, not just the scheduler, and thus is likely to result in a long debate, desire to share code, etc. So, I would advocate that this PR, with the kill-and-restart approach, is a reasonable first step. |
|
I really dislike baking in assumptions about process exiting, even temporarily. It makes assumptions about how the scheduler is being deployed that are not universally true. Paying down the technical debt of a component that wants to be able to reload config but currently can't seems more responsible. |
Is there any assumption besides that something will restart the scheduler if it fails? It seems like any reasonable deployment would have that property. |
ce1b9e3 to
94693d1
Compare
|
It assumes scheduler process failure (and subsequent restart/refill of cached state) is cheap and expected, that config changes are rare, and that no other components are impacted by scheduler process death. |
Fair enough. (Though I'd phrase the three one as "no other components are impacted in a significantly negative way by scheduler process death.") Are you aware of any current or expected deployment environments where these assumptions would not be true? |
|
@liggitt I should also add that the approach that you suggested (taking a stop channel and reconstructing the scheduler with the new config) also assumes that scheduler recreation is cheap, and that config changes are rare. |
That method allows it to be cheap by letting the informers reflecting cached cluster state be reused between scheduler configs (which they should be) |
That's a fair point. The state of the cluster can be shared, but other scheduler caches, such as equivalence cache, etc., will have to be invalidated. |
|
Probably every dense cluster I'm aware of today would find frequent
controller failover as undesirable.
My problem with this approach is that we have designed this tool that
works cleanly on and off cluster - config is delivered by the master
cleanly to anything running in a pod. I have yet to hear a reason why
the direct API approach is *better* than letting Kube deliver the
file. The default assumption id have is "prove the API approach is
better sufficient to justify reimplementing something the kubelet
already does for you"
|
|
fwiw, the setup you have in the integration test would't work out of the box in a secured cluster (like the ones GCE, GKE, or kubeadm set up)... the scheduler doesn't have list/watch access to all configmaps (nor does that seem appropriate) |
|
@liggitt Scheduler has been accessing configmaps in the integration tests, even before this PR. Well, more accurately scheduler has been accessing the configmap that the test creates. I don't understand why the new changes cause problem. |
|
@bsalamat: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Right now it has kube-system scoped access to configmaps, but not *. |
|
The API calls work in the integration test because it is not running a secured server. I was just pointing out that this is changing the access pattern to a list/watch, which won't work with default policy.
Only for purposes of accessing its leader-lock configmap. It does not have list access. |
|
@liggitt Hmm I wonder what is the best solution to that problem. Let's say we decided to hard-code the name of the ConfigMap for scheduler policy configuration, and by default gave the scheduler GET and WATCH permission on that particular ConfigMap. That seems safe. Could we do a GET followed by a WATCH, or is LIST the only way to get the latest version number (and thus this approach wouldn't work)? It would be nice if we could solve this problem without telling people they have to add non-default scheduler permissions in the RBAC rules in order to use this feature... Will bulk watch (kubernetes/community#443) help with this? I never read that proposal in detail but skimming it now, isn't it pretty insecure because it matches on object labels, which we still don't have a great security story for (other than denying mutation of an entire object)? |
|
@bsalamat PR needs rebase |
I don't think that's a good idea, it limits custom schedulers and config rollout. List is entangled in getting the latest resource version. I really dislike exposing all configmaps for this purpose.
It should help, and should not require object labels. It will let you watch a single object without needing list permissions (the primary driver is letting nodes watch individual secrets) |
|
So the only solution right now if we want this feature to work out of the box as-is is to modify the deployment scripts to give LIST, WATCH, and GET for all ConfigMaps to the scheduler? |
It would need list/watch/get within the namespace containing the configmap in order to use an informer as written in this PR. We had a similar issue with kube-dns and rather than add overly broad permissions to the default role, we worked to let kube-dns read config from mounted files. It still had the option to read directly from the API, but that wasn't the default and there was no hard-coded configmap it looks for. |
|
Typically, a particular deployment (e.g. GCE or kubeadm, etc) would grant any additional permissions to specific resources if required for their particular configuration of a component |
Right right, I meant for kube-system namespace. (Though even that is configurable, so really for it to work out of the box we'd have to do what I said (all namespaces)...) |
|
If we convert the scheduler to follow @mikedanese component config proposal and follow componentconfig->configmap-> and modify leader-lock to detect change to the configmap you get everything that this set out to do. |
|
@timothysc I haven't seen that proposal yet. If I get some time, I will take a look. |
|
Link to @mikedanese's doc:
https://docs.google.com/document/d/1arP4T9Qkp2SovlJZ_y790sBeiWXDO6SG10pZ_UUU-Lc/
…On Wed, Jul 19, 2017 at 11:48 AM, Bobby (Babak) Salamat < ***@***.***> wrote:
@timothysc <https://github.com/timothysc> I haven't seen that proposal
yet. If I get some time, I will take a look.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44805 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3JwffGX-W4-8HdjDrAww8z7Dj1I5eQks5sPk-DgaJpZM4NFPg7>
.
--
Michael Taufen
MTV SWE
|
|
@bsalamat Could you please rebase and update the PR? Or shall we keep it open for a while? |
|
@resouer Given the recent discussions on the PR, I am not sure if we are going to keep it in its current form. I will look into it soon when I get some time. |
| s, | ||
| kubecli, | ||
| informerFactory.Core().V1().Nodes(), | ||
| informerFactory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
giving the scheduler access to all informers for all objects seem like an antipattern. enumerating the objects it cares about lets us bound the inputs to the scheduler
| informerFactory.Extensions().V1beta1().ReplicaSets(), | ||
| informerFactory.Apps().V1beta1().StatefulSets(), | ||
| informerFactory.Core().V1().Services(), | ||
| configMapInformer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect the configmap watcher to leak into the scheduler here... the scheduler shouldn't care about the source of its config struct
|
@liggitt I don't think we really want to move forward with this PR. I haven't had the chance to see how we can do better here. Given my commitments, I am not so sure if I can get to this for 1.9 either. If others have better ideas to address the config monitoring in scheduler, please feel free to pick it up. |
|
xref: #52562 |
|
closing per-bot and cleanup. Please repo should we wish to rebase and handle. |
What this PR does / why we need it: This PR adds policy ConfigMap monitoring to the scheduler. With this change, scheduler watches its policy ConfigMap and upon receiving an update or deletion of its policy ConfigMap, it aborts and exits. In usual deployments, scheduler is expected to be recreated after it dies. When scheduler is recreated, it picks up its new ConfigMap.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close that issue when PR gets merged): fixes #41600Special notes for your reviewer:
Release note: