Add policy ConfigMap monitoring and restart to the scheduler. #44805

bsalamat · 2017-04-22T23:03:53Z

What this PR does / why we need it: This PR adds policy ConfigMap monitoring to the scheduler. With this change, scheduler watches its policy ConfigMap and upon receiving an update or deletion of its policy ConfigMap, it aborts and exits. In usual deployments, scheduler is expected to be recreated after it dies. When scheduler is recreated, it picks up its new ConfigMap.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #41600

Special notes for your reviewer:

Release note:

k8s-reviewable · 2017-04-22T23:04:03Z

This change is

bsalamat · 2017-04-22T23:05:05Z

/unassign erictune

/unassign enisoc

bsalamat · 2017-04-22T23:14:38Z

@kubernetes/sig-scheduling-pr-reviews

davidopp · 2017-04-23T05:07:40Z

plugin/cmd/kube-scheduler/app/configurator.go

+
+	sharedIndexInformer.AddEventHandlerWithResyncPeriod(
+		cache.ResourceEventHandlerFuncs {
+			AddFunc:    nil, // scheduler aborts if its policy ConfigMap does not exist. So, we do not need and "add" handler.


also, maybe say "if its policy ConfigMap does not exist and --<whatever the flag to enable legacy mode is> is enabled"

davidopp · 2017-04-23T05:08:48Z

plugin/cmd/kube-scheduler/app/configurator.go

+}
+
+func (sc *schedulerConfigurator) addPolicyConfigMap(obj interface{}) {
+	glog.Info("Received a request to add a scheduler policy config.")


Please make sure the policy is logged on scheduler startup. (I didn't check; it may be already.)

getSchedulerPolicyConfig is logging the contents at V(5) level, but I am going to change it to glog.Info.

davidopp · 2017-04-23T05:11:33Z

plugin/cmd/kube-scheduler/app/configurator.go

+		SchedulerKillFunc()
+	} else {
+		if sc.schedulerConfig != nil {
+			glog.Infof("Scheduler is going to die (and restarted) in order to update its policy.")


add something like "(ConfigMap %s/%s changed)", configmap namespace, configmap name

Added these to the handler logs.

davidopp · 2017-04-23T05:15:29Z

test/integration/scheduler/scheduler_test.go

 	"k8s.io/kubernetes/test/integration/framework"
 )

+const defaultSchedulerPolicyConfigMap string = "scheduler-custom-policy-config"


nit: I realize this is just a test, but this is a bit long and I think more Kubernetes-standard would be camel-case, so something like schedulerCustomPolicy

davidopp · 2017-04-23T05:26:35Z

plugin/cmd/kube-scheduler/app/configurator.go

+	)
+}
+
+// verifyNewPolicyConfig verifies that the new object received by the ConfigMap watch is a valid policy config.


Verifying the ConfigMap is an interesting idea (and clearly took some thought and work) but I actually think you shouldn't do it. More specifically, I think it will be confusing to the user if the scheduler is using an old ConfigMap that is cached in memory after the ConfigMap in etcd has changed. Someone who walks up to the cluster won't be able to look at the ConfigMap to know what the current scheduler behavior is. A related concern is that I think it's better to give the error about the invalid ConfigMap and die right when the ConfigMap is changed (to something invalid) and there is probably a human involved. If you let them get out of sync when the validation fails, then if the scheduler later crashes, when it restarts it will give an error but there may not be anyone around to deal with it.

Of course, on startup we should verify the ConfigMap and exit if it doesn't validate. (I assume we're doing that already; I didn't check.)

We had verified the ConfigMap on startup; if failed, scheduler exit. One purpose of this check is to avoid keep restarting scheduler when ConfigMap invalid after updating.

Anyway, I think we should add an event (and log) to show whether new config was accepted.

As Klaus mentioned, if we don't check the validity before restarting, scheduler will crash-loop, which may be acceptable if we assume there is a human watching the cluster.

I think crash looping when a human pushes a bad config is better than crash looping later if the scheduler crashes on its own due to a bug or machine problem or whatever, when there may be no human around.

davidopp · 2017-04-23T05:34:32Z

plugin/cmd/kube-scheduler/app/configurator.go

+			glog.Infof("Scheduler is going to die (and restarted) in order to update its policy.")
+			close(sc.schedulerConfig.StopEverything)
+			// The sleep is only to allow cleanups to happen.
+			time.Sleep(2 * time.Second)


Write a comment about how you chose 2 seconds? (Even if it was just chosen randomly...)

davidopp · 2017-04-23T05:35:46Z

plugin/cmd/kube-scheduler/app/configurator.go

+			glog.Flush()
+			os.Exit(0)
+		}
+		glog.Infof("Scheduler is not going to exit, as it doesn't seem to be initialized yet.")


I'm curious about this case. Did you observe this in reality? Would it be better to just exit in this case too?

I didn't observe this for updates, but I had initially added a handler for the addition of the ConfigMap as well. Kube-scheduler was sometimes getting the add notification before the scheduler was up. In that scenario, the addition event was being received asynchronously and killing the scheduler for that scenario wouldn't work, but now that we don't monitor the addition of the ConfigMap, I guess we can kill the scheduler regardless.

davidopp · 2017-04-23T05:47:18Z

test/integration/scheduler/scheduler_test.go

+		t.Error("Scheduler killer function is expected to be called upon update to its policy ConfigMap.")
+	}
+
+	// Scheduler should be killed when its policy ConfigMap is deleted.


for completeness, can you test both deletion and update? (even if you just copy the first half of this test into a new test and change it to update)

The test tests both update and restart. Well, the update part needs a comment, but I can split the two.

davidopp · 2017-04-23T05:48:26Z

test/integration/scheduler/scheduler_test.go

-		informerFactory.Core().V1().Services(),
-		eventBroadcaster.NewRecorder(api.Scheme, clientv1.EventSource{Component: v1.DefaultSchedulerName}),
-	)
+	sched, err := app.CreateScheduler(ss, clientSet, informerFactory,	eventBroadcaster.NewRecorder(api.Scheme, clientv1.EventSource{Component: v1.DefaultSchedulerName}))


you left some extra whitespace in the middle of the line

I don't see the extra whitespace. What is the word before the space?

Oh, I guess you are referring to what is shown here before eventBroadcaster. It is not in the code. Maybe it is a github issue in showing diffs.

wanghaoran1988 · 2017-04-25T02:51:31Z

/cc @aveshagarwal @jayunit100 Any thoughts on killing the scheduler progress when configMap changed?

liggitt · 2017-04-25T15:16:52Z

plugin/cmd/kube-scheduler/app/configurator.go

+
+func (sc *schedulerConfigurator) SetupPolicyConfigMapEventHandlers(client clientset.Interface, informerFactory informers.SharedInformerFactory) {
+	// selector targets only the scheduler's policy ConfigMap.
+	selector := cache.NewListWatchFromClient(client.CoreV1().RESTClient(), "configmaps", sc.policyConfigMapNamespace, fields.OneTermEqualSelector(api.ObjectNameField, string(sc.policyConfigMap)))


it seems very odd to go to all the trouble to make the scheduler watch the API, only as a signal to kill itself. why not just poll the specified config file if all it is doing is looking for changes in order to exit?

There is no config file. The new, preferred way of configuring scheduler policies is to use a ConfigMap object. Use of a ConfigMap object lets us use an API call to update scheduler configuration.

There is no config file. The new, preferred way of configuring scheduler policies is to use a ConfigMap object.

ConfigMaps are designed to be able to be consumed as files. Tightly coupling the config source to the API limits flexibility.

I tried to go over the advantages and disadvantages of using a file vs. a ConfigMap in this design doc. Given that we still support legacy config files as well, we thought using a ConfigMap would be simpler to implement and won't sacrifice flexibility. That part is already merged as a different PR, but if there are concerns which are not addressed, I will be happy to have a meeting to discuss the options.

liggitt · 2017-04-25T15:23:14Z

rather than making the kill-and-restart approach a fundamental part of the scheduler, we should start working toward the scheduler being able to reload from config cleanly.

This is a good opportunity to refactor the piece that depends on the changeable config into an isolated function that either takes a stop channel (and can therefore be cleanly stopped/reconstructed) or takes a config channel (and can therefore be cleanly fed a replacement config)

smarterclayton · 2017-04-25T15:36:59Z

So I know there are a bunch of dynamic configuration mechanisms in flight right now - I thought (@thockin, @mtaufen correct me if i'm wrong) that some of the emerging consensus from work on kube-dns and others was that we would use the config file on disk, driven by kubernetes, to read content dynamically. If we are not all in alignment, when can we sync back up and get on the same page?

bsalamat · 2017-04-25T18:46:42Z

@smarterclayton I have been aware of those efforts and talked with mtaufen and Biran Grant at least a couple of times about the right approach to use for scheduler. I also wrote this document to explain the pros and cons of using a config file (or ConfigMap backed by a file) vs. using a ConfigMap object directly. We eventually decided to use the ConfigMap object directly for two main reasons: 1. it does not require a sidecar in hosted solutions, 2. monitoring a ConfigMap object can be done with the existing kubernetes informers, instead of inotify, etc. More details are provided in the document.

davidopp · 2017-04-26T00:21:23Z

rather than making the kill-and-restart approach a fundamental part of the scheduler, we should start working toward the scheduler being able to reload from config cleanly.

This seems like a reasonable "future work" that is compatible with kill-and-restart as the first implementation, to let us get this useful feature out in the wild more quickly. Also, having components reload their config cleanly is something that is needed by many components, not just the scheduler, and thus is likely to result in a long debate, desire to share code, etc. So, I would advocate that this PR, with the kill-and-restart approach, is a reasonable first step.

liggitt · 2017-04-26T00:27:19Z

I really dislike baking in assumptions about process exiting, even temporarily. It makes assumptions about how the scheduler is being deployed that are not universally true. Paying down the technical debt of a component that wants to be able to reload config but currently can't seems more responsible.

davidopp · 2017-04-26T00:39:34Z

It makes assumptions about how the scheduler is being deployed that are not universally true.

Is there any assumption besides that something will restart the scheduler if it fails? It seems like any reasonable deployment would have that property.

liggitt · 2017-04-26T00:49:01Z

It assumes scheduler process failure (and subsequent restart/refill of cached state) is cheap and expected, that config changes are rare, and that no other components are impacted by scheduler process death.

davidopp · 2017-04-26T00:51:50Z

It assumes scheduler process failure is cheap and expected, that config changes are rare, and that no other components are impacted by scheduler process death.

Fair enough. (Though I'd phrase the three one as "no other components are impacted in a significantly negative way by scheduler process death.")

Are you aware of any current or expected deployment environments where these assumptions would not be true?

bsalamat · 2017-04-26T00:55:20Z

@liggitt I should also add that the approach that you suggested (taking a stop channel and reconstructing the scheduler with the new config) also assumes that scheduler recreation is cheap, and that config changes are rare.

liggitt · 2017-04-26T00:58:33Z

I should also add that the approach that you suggested (taking a stop channel and reconstructing the scheduler with the new config) also assumes that scheduler recreation is cheap, and that config changes are rare.

That method allows it to be cheap by letting the informers reflecting cached cluster state be reused between scheduler configs (which they should be)

bsalamat · 2017-04-26T01:04:39Z

That method alllows it to be cheap by letting the informers reflecting cached cluster state be reused between schedulers (which they should be)

That's a fair point. The state of the cluster can be shared, but other scheduler caches, such as equivalence cache, etc., will have to be invalidated.
I will take another look to see how hard it is to reconstruct the scheduler without killing the process.

smarterclayton · 2017-04-26T01:10:06Z

Probably every dense cluster I'm aware of today would find frequent controller failover as undesirable. My problem with this approach is that we have designed this tool that works cleanly on and off cluster - config is delivered by the master cleanly to anything running in a pod. I have yet to hear a reason why the direct API approach is *better* than letting Kube deliver the file. The default assumption id have is "prove the API approach is better sufficient to justify reimplementing something the kubelet already does for you"

bsalamat · 2017-07-18T01:46:53Z

@davidopp @liggitt
Thanks again for your comments! I fixed the code and added a new informer for the policy config. PTAL.

liggitt · 2017-07-18T01:55:54Z

fwiw, the setup you have in the integration test would't work out of the box in a secured cluster (like the ones GCE, GKE, or kubeadm set up)... the scheduler doesn't have list/watch access to all configmaps (nor does that seem appropriate)

bsalamat · 2017-07-18T02:09:31Z

@liggitt Scheduler has been accessing configmaps in the integration tests, even before this PR. Well, more accurately scheduler has been accessing the configmap that the test creates. I don't understand why the new changes cause problem.

k8s-ci-robot · 2017-07-18T02:12:22Z

@bsalamat: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-kubemark-e2e-gce	`9ba2d2f`	link	`/test pull-kubernetes-kubemark-e2e-gce`
pull-kubernetes-e2e-gce-etcd3	`9ba2d2f`	link	`/test pull-kubernetes-e2e-gce-etcd3`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

timothysc · 2017-07-18T02:12:54Z

Right now it has kube-system scoped access to configmaps, but not *.
1fb55a5

liggitt · 2017-07-18T02:16:00Z

The API calls work in the integration test because it is not running a secured server. I was just pointing out that this is changing the access pattern to a list/watch, which won't work with default policy.

Right now it has kube-system scoped access to configmaps, but not *.

Only for purposes of accessing its leader-lock configmap. It does not have list access.

davidopp · 2017-07-18T06:43:23Z

@liggitt Hmm I wonder what is the best solution to that problem.

Let's say we decided to hard-code the name of the ConfigMap for scheduler policy configuration, and by default gave the scheduler GET and WATCH permission on that particular ConfigMap. That seems safe. Could we do a GET followed by a WATCH, or is LIST the only way to get the latest version number (and thus this approach wouldn't work)?

It would be nice if we could solve this problem without telling people they have to add non-default scheduler permissions in the RBAC rules in order to use this feature...

Will bulk watch (kubernetes/community#443) help with this? I never read that proposal in detail but skimming it now, isn't it pretty insecure because it matches on object labels, which we still don't have a great security story for (other than denying mutation of an entire object)?

k8s-github-robot · 2017-07-18T06:43:25Z

@bsalamat PR needs rebase

liggitt · 2017-07-18T06:49:00Z

Let's say we decided to hard-code the name of the ConfigMap for scheduler policy configuration

I don't think that's a good idea, it limits custom schedulers and config rollout.

List is entangled in getting the latest resource version. I really dislike exposing all configmaps for this purpose.

Will bulk watch (kubernetes/community#443) help with this? I never read that proposal in detail but skimming it now, isn't it pretty insecure because it matches on object labels

It should help, and should not require object labels. It will let you watch a single object without needing list permissions (the primary driver is letting nodes watch individual secrets)

davidopp · 2017-07-18T06:59:15Z

So the only solution right now if we want this feature to work out of the box as-is is to modify the deployment scripts to give LIST, WATCH, and GET for all ConfigMaps to the scheduler?

liggitt · 2017-07-18T07:16:53Z

So the only solution right now if we want this feature to work out of the box as-is is to modify the deployment scripts to give LIST, WATCH, and GET for all ConfigMaps to the scheduler?

It would need list/watch/get within the namespace containing the configmap in order to use an informer as written in this PR.

We had a similar issue with kube-dns and rather than add overly broad permissions to the default role, we worked to let kube-dns read config from mounted files. It still had the option to read directly from the API, but that wasn't the default and there was no hard-coded configmap it looks for.

liggitt · 2017-07-18T07:18:40Z

Typically, a particular deployment (e.g. GCE or kubeadm, etc) would grant any additional permissions to specific resources if required for their particular configuration of a component

davidopp · 2017-07-18T07:39:39Z

It would need list/watch/get within the namespace containing the configmap in order to use an informer as written in this PR.

Right right, I meant for kube-system namespace. (Though even that is configurable, so really for it to work out of the box we'd have to do what I said (all namespaces)...)

timothysc · 2017-07-19T02:23:12Z

If we convert the scheduler to follow @mikedanese component config proposal and follow componentconfig->configmap-> and modify leader-lock to detect change to the configmap you get everything that this set out to do.

bsalamat · 2017-07-19T18:47:35Z

@timothysc I haven't seen that proposal yet. If I get some time, I will take a look.

mtaufen · 2017-07-19T19:54:06Z

Link to @mikedanese's doc: https://docs.google.com/document/d/1arP4T9Qkp2SovlJZ_y790sBeiWXDO6SG10pZ_UUU-Lc/

…

On Wed, Jul 19, 2017 at 11:48 AM, Bobby (Babak) Salamat < ***@***.***> wrote: @timothysc <https://github.com/timothysc> I haven't seen that proposal yet. If I get some time, I will take a look. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#44805 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3JwffGX-W4-8HdjDrAww8z7Dj1I5eQks5sPk-DgaJpZM4NFPg7> .

-- Michael Taufen MTV SWE

resouer · 2017-08-19T08:38:54Z

@bsalamat Could you please rebase and update the PR? Or shall we keep it open for a while?

bsalamat · 2017-08-19T09:00:09Z

@resouer Given the recent discussions on the PR, I am not sure if we are going to keep it in its current form. I will look into it soon when I get some time.

liggitt · 2017-09-14T18:37:01Z

plugin/cmd/kube-scheduler/app/server.go

 		s,
 		kubecli,
-		informerFactory.Core().V1().Nodes(),
+		informerFactory,


giving the scheduler access to all informers for all objects seem like an antipattern. enumerating the objects it cares about lets us bound the inputs to the scheduler

liggitt · 2017-09-14T18:37:50Z

plugin/cmd/kube-scheduler/app/server.go

-		informerFactory.Extensions().V1beta1().ReplicaSets(),
-		informerFactory.Apps().V1beta1().StatefulSets(),
-		informerFactory.Core().V1().Services(),
+		configMapInformer,


I wouldn't expect the configmap watcher to leak into the scheduler here... the scheduler shouldn't care about the source of its config struct

bsalamat · 2017-09-14T18:54:47Z

@liggitt I don't think we really want to move forward with this PR. I haven't had the chance to see how we can do better here. Given my commitments, I am not so sure if I can get to this for 1.9 either.

If others have better ideas to address the config monitoring in scheduler, please feel free to pick it up.

timothysc · 2017-09-18T23:00:56Z

xref: #52562

k8s-github-robot · 2017-11-19T08:14:39Z

This PR hasn't been active in 61 days. It will be closed in 28 days (Dec 17, 2017).

cc @bsalamat @davidopp

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

timothysc · 2018-01-03T20:33:27Z

closing per-bot and cleanup. Please repo should we wish to rebase and handle.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 22, 2017

k8s-github-robot assigned erictune and enisoc Apr 22, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 22, 2017

bsalamat force-pushed the config_monitor branch from 97efff2 to 0d5682c Compare April 22, 2017 23:09

davidopp reviewed Apr 23, 2017

View reviewed changes

davidopp assigned davidopp and unassigned enisoc and erictune Apr 23, 2017

bsalamat force-pushed the config_monitor branch 2 times, most recently from e519f57 to ce1b9e3 Compare April 25, 2017 00:52

liggitt reviewed Apr 25, 2017

View reviewed changes

bsalamat force-pushed the config_monitor branch from ce1b9e3 to 94693d1 Compare April 26, 2017 00:43

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2017

enisoc removed their assignment Aug 1, 2017

resouer mentioned this pull request Aug 18, 2017

Scheduler policy can be read from ConfigMap kubernetes/enhancements#374

Closed

calebamiles modified the milestone: v1.8 Sep 2, 2017

liggitt reviewed Sep 14, 2017

View reviewed changes

timothysc closed this Jan 3, 2018

Add policy ConfigMap monitoring and restart to the scheduler. #44805

Add policy ConfigMap monitoring and restart to the scheduler. #44805

Uh oh!

Conversation

bsalamat commented Apr 22, 2017

Uh oh!

k8s-reviewable commented Apr 22, 2017

Uh oh!

bsalamat commented Apr 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsalamat commented Apr 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanghaoran1988 commented Apr 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bsalamat Apr 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liggitt commented Apr 25, 2017

Uh oh!

smarterclayton commented Apr 25, 2017

Uh oh!

bsalamat commented Apr 25, 2017

Uh oh!

davidopp commented Apr 26, 2017

Uh oh!

liggitt commented Apr 26, 2017

Uh oh!

davidopp commented Apr 26, 2017

Uh oh!

liggitt commented Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidopp commented Apr 26, 2017

Uh oh!

bsalamat commented Apr 22, 2017 •

edited

Loading

bsalamat Apr 25, 2017 •

edited

Loading

liggitt commented Apr 26, 2017 •

edited

Loading

bsalamat commented Apr 26, 2017 •

edited

Loading

liggitt commented Apr 26, 2017 •

edited

Loading

bsalamat commented Apr 26, 2017 •

edited

Loading

k8s-ci-robot commented Jul 18, 2017 •

edited

Loading

timothysc commented Jul 18, 2017 •

edited

Loading

liggitt commented Jul 18, 2017 •

edited

Loading