Skip to content

Conversation

@bsalamat
Copy link
Member

What this PR does / why we need it: This PR adds policy ConfigMap monitoring to the scheduler. With this change, scheduler watches its policy ConfigMap and upon receiving an update or deletion of its policy ConfigMap, it aborts and exits. In usual deployments, scheduler is expected to be recreated after it dies. When scheduler is recreated, it picks up its new ConfigMap.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #41600

Special notes for your reviewer:

Release note:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 22, 2017
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 22, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@bsalamat
Copy link
Member Author

bsalamat commented Apr 22, 2017

/unassign erictune

/unassign enisoc

@bsalamat
Copy link
Member Author

@kubernetes/sig-scheduling-pr-reviews


sharedIndexInformer.AddEventHandlerWithResyncPeriod(
cache.ResourceEventHandlerFuncs {
AddFunc: nil, // scheduler aborts if its policy ConfigMap does not exist. So, we do not need and "add" handler.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/and/an/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, maybe say "if its policy ConfigMap does not exist and --<whatever the flag to enable legacy mode is> is enabled"

}

func (sc *schedulerConfigurator) addPolicyConfigMap(obj interface{}) {
glog.Info("Received a request to add a scheduler policy config.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure the policy is logged on scheduler startup. (I didn't check; it may be already.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSchedulerPolicyConfig is logging the contents at V(5) level, but I am going to change it to glog.Info.

SchedulerKillFunc()
} else {
if sc.schedulerConfig != nil {
glog.Infof("Scheduler is going to die (and restarted) in order to update its policy.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add something like "(ConfigMap %s/%s changed)", configmap namespace, configmap name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these to the handler logs.

"k8s.io/kubernetes/test/integration/framework"
)

const defaultSchedulerPolicyConfigMap string = "scheduler-custom-policy-config"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I realize this is just a test, but this is a bit long and I think more Kubernetes-standard would be camel-case, so something like schedulerCustomPolicy

)
}

// verifyNewPolicyConfig verifies that the new object received by the ConfigMap watch is a valid policy config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verifying the ConfigMap is an interesting idea (and clearly took some thought and work) but I actually think you shouldn't do it. More specifically, I think it will be confusing to the user if the scheduler is using an old ConfigMap that is cached in memory after the ConfigMap in etcd has changed. Someone who walks up to the cluster won't be able to look at the ConfigMap to know what the current scheduler behavior is. A related concern is that I think it's better to give the error about the invalid ConfigMap and die right when the ConfigMap is changed (to something invalid) and there is probably a human involved. If you let them get out of sync when the validation fails, then if the scheduler later crashes, when it restarts it will give an error but there may not be anyone around to deal with it.

Of course, on startup we should verify the ConfigMap and exit if it doesn't validate. (I assume we're doing that already; I didn't check.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had verified the ConfigMap on startup; if failed, scheduler exit. One purpose of this check is to avoid keep restarting scheduler when ConfigMap invalid after updating.

Anyway, I think we should add an event (and log) to show whether new config was accepted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Klaus mentioned, if we don't check the validity before restarting, scheduler will crash-loop, which may be acceptable if we assume there is a human watching the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think crash looping when a human pushes a bad config is better than crash looping later if the scheduler crashes on its own due to a bug or machine problem or whatever, when there may be no human around.

glog.Infof("Scheduler is going to die (and restarted) in order to update its policy.")
close(sc.schedulerConfig.StopEverything)
// The sleep is only to allow cleanups to happen.
time.Sleep(2 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write a comment about how you chose 2 seconds? (Even if it was just chosen randomly...)

glog.Flush()
os.Exit(0)
}
glog.Infof("Scheduler is not going to exit, as it doesn't seem to be initialized yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about this case. Did you observe this in reality? Would it be better to just exit in this case too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't observe this for updates, but I had initially added a handler for the addition of the ConfigMap as well. Kube-scheduler was sometimes getting the add notification before the scheduler was up. In that scenario, the addition event was being received asynchronously and killing the scheduler for that scenario wouldn't work, but now that we don't monitor the addition of the ConfigMap, I guess we can kill the scheduler regardless.

t.Error("Scheduler killer function is expected to be called upon update to its policy ConfigMap.")
}

// Scheduler should be killed when its policy ConfigMap is deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for completeness, can you test both deletion and update? (even if you just copy the first half of this test into a new test and change it to update)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test tests both update and restart. Well, the update part needs a comment, but I can split the two.

informerFactory.Core().V1().Services(),
eventBroadcaster.NewRecorder(api.Scheme, clientv1.EventSource{Component: v1.DefaultSchedulerName}),
)
sched, err := app.CreateScheduler(ss, clientSet, informerFactory, eventBroadcaster.NewRecorder(api.Scheme, clientv1.EventSource{Component: v1.DefaultSchedulerName}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you left some extra whitespace in the middle of the line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the extra whitespace. What is the word before the space?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess you are referring to what is shown here before eventBroadcaster. It is not in the code. Maybe it is a github issue in showing diffs.

@davidopp davidopp assigned davidopp and unassigned enisoc and erictune Apr 23, 2017
@bsalamat bsalamat force-pushed the config_monitor branch 2 times, most recently from e519f57 to ce1b9e3 Compare April 25, 2017 00:52
@wanghaoran1988
Copy link
Contributor

/cc @aveshagarwal @jayunit100 Any thoughts on killing the scheduler progress when configMap changed?


func (sc *schedulerConfigurator) SetupPolicyConfigMapEventHandlers(client clientset.Interface, informerFactory informers.SharedInformerFactory) {
// selector targets only the scheduler's policy ConfigMap.
selector := cache.NewListWatchFromClient(client.CoreV1().RESTClient(), "configmaps", sc.policyConfigMapNamespace, fields.OneTermEqualSelector(api.ObjectNameField, string(sc.policyConfigMap)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems very odd to go to all the trouble to make the scheduler watch the API, only as a signal to kill itself. why not just poll the specified config file if all it is doing is looking for changes in order to exit?

Copy link
Member Author

@bsalamat bsalamat Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no config file. The new, preferred way of configuring scheduler policies is to use a ConfigMap object. Use of a ConfigMap object lets us use an API call to update scheduler configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no config file. The new, preferred way of configuring scheduler policies is to use a ConfigMap object.

ConfigMaps are designed to be able to be consumed as files. Tightly coupling the config source to the API limits flexibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to go over the advantages and disadvantages of using a file vs. a ConfigMap in this design doc. Given that we still support legacy config files as well, we thought using a ConfigMap would be simpler to implement and won't sacrifice flexibility. That part is already merged as a different PR, but if there are concerns which are not addressed, I will be happy to have a meeting to discuss the options.

@liggitt
Copy link
Member

liggitt commented Apr 25, 2017

rather than making the kill-and-restart approach a fundamental part of the scheduler, we should start working toward the scheduler being able to reload from config cleanly.

This is a good opportunity to refactor the piece that depends on the changeable config into an isolated function that either takes a stop channel (and can therefore be cleanly stopped/reconstructed) or takes a config channel (and can therefore be cleanly fed a replacement config)

@smarterclayton
Copy link
Contributor

So I know there are a bunch of dynamic configuration mechanisms in flight right now - I thought (@thockin, @mtaufen correct me if i'm wrong) that some of the emerging consensus from work on kube-dns and others was that we would use the config file on disk, driven by kubernetes, to read content dynamically. If we are not all in alignment, when can we sync back up and get on the same page?

@bsalamat
Copy link
Member Author

@smarterclayton I have been aware of those efforts and talked with mtaufen and Biran Grant at least a couple of times about the right approach to use for scheduler. I also wrote this document to explain the pros and cons of using a config file (or ConfigMap backed by a file) vs. using a ConfigMap object directly. We eventually decided to use the ConfigMap object directly for two main reasons: 1. it does not require a sidecar in hosted solutions, 2. monitoring a ConfigMap object can be done with the existing kubernetes informers, instead of inotify, etc. More details are provided in the document.

@davidopp
Copy link
Contributor

rather than making the kill-and-restart approach a fundamental part of the scheduler, we should start working toward the scheduler being able to reload from config cleanly.

This seems like a reasonable "future work" that is compatible with kill-and-restart as the first implementation, to let us get this useful feature out in the wild more quickly. Also, having components reload their config cleanly is something that is needed by many components, not just the scheduler, and thus is likely to result in a long debate, desire to share code, etc. So, I would advocate that this PR, with the kill-and-restart approach, is a reasonable first step.

@liggitt
Copy link
Member

liggitt commented Apr 26, 2017

I really dislike baking in assumptions about process exiting, even temporarily. It makes assumptions about how the scheduler is being deployed that are not universally true. Paying down the technical debt of a component that wants to be able to reload config but currently can't seems more responsible.

@davidopp
Copy link
Contributor

It makes assumptions about how the scheduler is being deployed that are not universally true.

Is there any assumption besides that something will restart the scheduler if it fails? It seems like any reasonable deployment would have that property.

@liggitt
Copy link
Member

liggitt commented Apr 26, 2017

It assumes scheduler process failure (and subsequent restart/refill of cached state) is cheap and expected, that config changes are rare, and that no other components are impacted by scheduler process death.

@davidopp
Copy link
Contributor

It assumes scheduler process failure is cheap and expected, that config changes are rare, and that no other components are impacted by scheduler process death.

Fair enough. (Though I'd phrase the three one as "no other components are impacted in a significantly negative way by scheduler process death.")

Are you aware of any current or expected deployment environments where these assumptions would not be true?

@bsalamat
Copy link
Member Author

bsalamat commented Apr 26, 2017

@liggitt I should also add that the approach that you suggested (taking a stop channel and reconstructing the scheduler with the new config) also assumes that scheduler recreation is cheap, and that config changes are rare.

@liggitt
Copy link
Member

liggitt commented Apr 26, 2017

I should also add that the approach that you suggested (taking a stop channel and reconstructing the scheduler with the new config) also assumes that scheduler recreation is cheap, and that config changes are rare.

That method allows it to be cheap by letting the informers reflecting cached cluster state be reused between scheduler configs (which they should be)

@bsalamat
Copy link
Member Author

bsalamat commented Apr 26, 2017

That method alllows it to be cheap by letting the informers reflecting cached cluster state be reused between schedulers (which they should be)

That's a fair point. The state of the cluster can be shared, but other scheduler caches, such as equivalence cache, etc., will have to be invalidated.
I will take another look to see how hard it is to reconstruct the scheduler without killing the process.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 26, 2017 via email

@bsalamat
Copy link
Member Author

@davidopp @liggitt
Thanks again for your comments! I fixed the code and added a new informer for the policy config. PTAL.

@liggitt
Copy link
Member

liggitt commented Jul 18, 2017

fwiw, the setup you have in the integration test would't work out of the box in a secured cluster (like the ones GCE, GKE, or kubeadm set up)... the scheduler doesn't have list/watch access to all configmaps (nor does that seem appropriate)

@bsalamat
Copy link
Member Author

@liggitt Scheduler has been accessing configmaps in the integration tests, even before this PR. Well, more accurately scheduler has been accessing the configmap that the test creates. I don't understand why the new changes cause problem.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 18, 2017

@bsalamat: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-kubemark-e2e-gce 9ba2d2f link /test pull-kubernetes-kubemark-e2e-gce
pull-kubernetes-e2e-gce-etcd3 9ba2d2f link /test pull-kubernetes-e2e-gce-etcd3

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@timothysc
Copy link
Contributor

timothysc commented Jul 18, 2017

Right now it has kube-system scoped access to configmaps, but not *.
1fb55a5

@liggitt
Copy link
Member

liggitt commented Jul 18, 2017

The API calls work in the integration test because it is not running a secured server. I was just pointing out that this is changing the access pattern to a list/watch, which won't work with default policy.

Right now it has kube-system scoped access to configmaps, but not *.

Only for purposes of accessing its leader-lock configmap. It does not have list access.

@davidopp
Copy link
Contributor

@liggitt Hmm I wonder what is the best solution to that problem.

Let's say we decided to hard-code the name of the ConfigMap for scheduler policy configuration, and by default gave the scheduler GET and WATCH permission on that particular ConfigMap. That seems safe. Could we do a GET followed by a WATCH, or is LIST the only way to get the latest version number (and thus this approach wouldn't work)?

It would be nice if we could solve this problem without telling people they have to add non-default scheduler permissions in the RBAC rules in order to use this feature...

Will bulk watch (kubernetes/community#443) help with this? I never read that proposal in detail but skimming it now, isn't it pretty insecure because it matches on object labels, which we still don't have a great security story for (other than denying mutation of an entire object)?

@k8s-github-robot
Copy link

@bsalamat PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2017
@liggitt
Copy link
Member

liggitt commented Jul 18, 2017

Let's say we decided to hard-code the name of the ConfigMap for scheduler policy configuration

I don't think that's a good idea, it limits custom schedulers and config rollout.

List is entangled in getting the latest resource version. I really dislike exposing all configmaps for this purpose.

Will bulk watch (kubernetes/community#443) help with this? I never read that proposal in detail but skimming it now, isn't it pretty insecure because it matches on object labels

It should help, and should not require object labels. It will let you watch a single object without needing list permissions (the primary driver is letting nodes watch individual secrets)

@davidopp
Copy link
Contributor

So the only solution right now if we want this feature to work out of the box as-is is to modify the deployment scripts to give LIST, WATCH, and GET for all ConfigMaps to the scheduler?

@liggitt
Copy link
Member

liggitt commented Jul 18, 2017

So the only solution right now if we want this feature to work out of the box as-is is to modify the deployment scripts to give LIST, WATCH, and GET for all ConfigMaps to the scheduler?

It would need list/watch/get within the namespace containing the configmap in order to use an informer as written in this PR.

We had a similar issue with kube-dns and rather than add overly broad permissions to the default role, we worked to let kube-dns read config from mounted files. It still had the option to read directly from the API, but that wasn't the default and there was no hard-coded configmap it looks for.

@liggitt
Copy link
Member

liggitt commented Jul 18, 2017

Typically, a particular deployment (e.g. GCE or kubeadm, etc) would grant any additional permissions to specific resources if required for their particular configuration of a component

@davidopp
Copy link
Contributor

It would need list/watch/get within the namespace containing the configmap in order to use an informer as written in this PR.

Right right, I meant for kube-system namespace. (Though even that is configurable, so really for it to work out of the box we'd have to do what I said (all namespaces)...)

@timothysc
Copy link
Contributor

If we convert the scheduler to follow @mikedanese component config proposal and follow componentconfig->configmap-> and modify leader-lock to detect change to the configmap you get everything that this set out to do.

@bsalamat
Copy link
Member Author

@timothysc I haven't seen that proposal yet. If I get some time, I will take a look.

@mtaufen
Copy link
Contributor

mtaufen commented Jul 19, 2017 via email

@resouer
Copy link
Contributor

resouer commented Aug 19, 2017

@bsalamat Could you please rebase and update the PR? Or shall we keep it open for a while?

@bsalamat
Copy link
Member Author

@resouer Given the recent discussions on the PR, I am not sure if we are going to keep it in its current form. I will look into it soon when I get some time.

@calebamiles calebamiles modified the milestone: v1.8 Sep 2, 2017
s,
kubecli,
informerFactory.Core().V1().Nodes(),
informerFactory,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giving the scheduler access to all informers for all objects seem like an antipattern. enumerating the objects it cares about lets us bound the inputs to the scheduler

informerFactory.Extensions().V1beta1().ReplicaSets(),
informerFactory.Apps().V1beta1().StatefulSets(),
informerFactory.Core().V1().Services(),
configMapInformer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect the configmap watcher to leak into the scheduler here... the scheduler shouldn't care about the source of its config struct

@bsalamat
Copy link
Member Author

@liggitt I don't think we really want to move forward with this PR. I haven't had the chance to see how we can do better here. Given my commitments, I am not so sure if I can get to this for 1.9 either.

If others have better ideas to address the config monitoring in scheduler, please feel free to pick it up.

@timothysc
Copy link
Contributor

xref: #52562

@k8s-github-robot
Copy link

This PR hasn't been active in 61 days. It will be closed in 28 days (Dec 17, 2017).

cc @bsalamat @davidopp

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@timothysc
Copy link
Contributor

closing per-bot and cleanup. Please repo should we wish to rebase and handle.

@timothysc timothysc closed this Jan 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow dynamic change of scheduler's policy configuration