-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rotate kubelet client certificate. #41912
Rotate kubelet client certificate. #41912
Conversation
cc @mikedanese |
e1f4b55
to
d64a5a1
Compare
d64a5a1
to
5782d15
Compare
@liggitt, @mikedanese, @timstclair Anyone got some time before the code freeze? |
cmd/kubelet/app/options/options.go
Outdated
@@ -131,6 +131,7 @@ func (s *KubeletServer) AddFlags(fs *pflag.FlagSet) { | |||
fs.StringVar(&s.TLSPrivateKeyFile, "tls-private-key-file", s.TLSPrivateKeyFile, "File containing x509 private key matching --tls-cert-file.") | |||
fs.StringVar(&s.CertDirectory, "cert-dir", s.CertDirectory, "The directory where the TLS certs are located (by default /var/run/kubernetes). "+ | |||
"If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored.") | |||
fs.Int32Var(&s.CertRotation, "experimental-cert-rotation", s.CertRotation, "<Warning: Experimental feature> The percentage threshold of certificate duration remaining before a new key should be generated and signed using the certificate signing request API on the api server. New cert/key pairs generated by this process will be stored in the directory specified by --cert-dir. If less than 1, rotation is disabled. [default=0]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is it to give the admin this knob? It seems like something that should be tuned on our end, not by the user. If that's the case, I suggest using a feature_gate instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for feature gate intially (default to off this release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with feature_gate. This flag sets the amount of remaining duration of the certificate before it gets rotated. Does a feature_gate allow some value to be set? Or hard code this to something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so far they're just binary flags, and let you tag features as alpha/beta. My suggestion above was that we hardcode a value for this, unless you think it's an important knob to provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this knob is pretty useful for testing, and when restarts are no longer part of rotation we can turn this up and rotate frequently during end to end tests.
cmd/kubelet/app/server.go
Outdated
CertificateRotationPercent: rotation, | ||
Rotation: func() { | ||
glog.Info("The client certificate was rotated, restarting so changes take effect.") | ||
os.Exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too disruptive. My biggest concern about restarting the Kubelet is it would disrupt any open connections (watching logs, exec/attach/portforward connections). It would also stress the runtime, and may have other performance impacts.
Fortunately, the ListenAndServeTLS function is not too complicated, and should be pretty easy to unroll and swap in our own Accept
implementation:
ListenAndServeTLS
just sets up a tls listener, and callsserver.Serve(...)
on it.tls.NewListener
just wraps atls.listener
, which simlpy wraps theConn
returned by the inner listener.
I think all we'd need to do is create our own listener wrapper which calls tls.Server
with the tls.Config
with the latest certs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think @sttts already did an implementation to pass SNI certs to apiserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func runServer(server *http.Server, network string, stopCh <-chan struct{}) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more about being able to stop the server with a channel. Now with Go 1.8 we will have an official API for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid stopping the server in this case? I believe stopping the server will have the same problems mentioned above (or are existing connections kept alive when the server stops?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed with @timstclair on this. Restarting Kubelet is very disruptive. In addition to the issues pointed out by @timstclair, there are other concerns:
- In GKE, we might agree upon O(months) certification rotation, but how we ensure this behavior for OSS users?
- Any misbehaviour operation on certification rotation might lead to a node readiness flapping. But from the users' viewpoint, they only see node status is changed between ready and notReady unless they view the log. This adds more debuggability and introspection difficulties.
- We encountered a lot of issues with mystery kubelet / docker restarts due to various reasons in prod. This feature only makes things worse.
- What about other standalone daemons running on the node, which are talking to API server too, for example, NPD in GKE. Based on this design, are we going to force NPD restart too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continue with above comments:
5. The worst concern I had based on this design is that all kubelets in the cluster might start at the same time and pulling the information from API server. Can API server handle that load? Shouldn't this is another potential API server scalability issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering 5 from above comment - I think that restating all kubelets at once can overload apiserver for quite some time (in fact I did such experiment in 2k-node cluster few months ago and it recovered from backlog after 5minutes or so IIRC).
I didn't read the PR, but if this is going to restart all kubelets at the same time, it's not only bad from scalability perspective, it just sounds bad to me.
We should try spreading the restarts over some time. If the time is long enough, I wouldn't worry about scalability (I'm not sure what long enough means, but from the top of my head, 1hour/1000nodes should be enough).
[And sorry if it's already spread - as I said, I didn't read the PR, just read the comments above.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not restarting all the kubelets at the same time on purpose, but they will restart at the same time period after they were added to the cluster, so if a lot of kubelets are added at approximately the same time, they could clobber the certificate request signing api at the same time 10 months later. I'll add some jitter to the rotation timing.
It looks like this PR won't make 1.6 code freeze. It also looks like transitioning to Go 1.8 is in progress for Kubernetes 1.7. Once that's done we should be able to rotate client certificates without a restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed an adjustment to the certificate manager to spread out the certificate rotation requests over here #42498
hack/verify-flags/known-flags.txt
Outdated
log-lines-total | ||
run-duration | ||
attach-detach-reconcile-sync-period | ||
disable-attach-detach-reconcile-sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were all these other flags added? Is it just a rebase issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not my flags. Must be a rebase issue. Will remove.
if err != nil { | ||
return nil, err | ||
} | ||
|
||
if certRotationPercent > 100 { | ||
certRotationPercent = 100 | ||
if config.CertificateRotationPercent > 100 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
func getCurrentCertificateOrBootstrap( | ||
store Store, | ||
bootstrapCertificatePEM []byte, | ||
bootstrapKeyPEM []byte) (*CertKeyData, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name the return values (what is the bool?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if err != nil { | ||
return nil, false, fmt.Errorf("unable to parse certificate data: %v", err) | ||
} | ||
cert.Leaf = certs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is len(certs)
guaranteed to be > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Length of the output will equal the length of the input, so in this case, always 1, as long as there is no err.
notAfter := m.cert.Leaf.NotAfter | ||
total := notAfter.Sub(m.cert.Leaf.NotBefore) | ||
notAfter := m.certKeyData.Certificate.Leaf.NotAfter | ||
total := notAfter.Sub(m.certKeyData.Certificate.Leaf.NotBefore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the notAfter value come from? Shouldn't we be able to calculate exactly when the next rotation needs to occur, and wait until that moment, removing the need for the sync period polling & shouldRotate()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notAfter is certificate data. It's the expiration of the certificate. We can use it as you suggest. A follow up PR ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue tracking follow up: #42089
// certificate, if there is no current certificate available in the | ||
// CertificateStore. If there is a current certificate available, this will | ||
// be ignored. | ||
BootstrapKeyPEM []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a BootstrapClient and a certificatesclient.CertificateSigningRequestInterface? Why shouldn't I be able to use any form of authn to bootstrap my client TLS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can use any authn to initialize your TLS client, but initializing your TLS client doesn't happen in the certificate manager. The certificate manager is the consumer of an already initialized client, only responsible for rotating as expiration approaches and returning the current cert/key pair. If you aren't going to use certs and keys, you wouldn't be using the CertificateManager, which is only designed for managing cert/key pairs.
There is already a CertificateSigningRequestInterface
in this struct, it's the first member.
I see the comments for these two members are incorrect and may have mislead you. I'll update it. The bootstrap cert/key pair are the data returned from the certificate manager if the Store
doesn't currently have any cert/key pairs to return.
@@ -46,85 +46,142 @@ const ( | |||
// manager. In the background it communicates with the API server to get new | |||
// certificates for certificates about to expire. | |||
type Manager interface { | |||
// CertificateSigningRequestClient sets the client interface that is used for | |||
// signing new certificates generated as part of rotation. | |||
CertificateSigningRequestClient(certificatesclient.CertificateSigningRequestInterface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public setters leave a bad taste in my mouth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an awkward situation alright. Since the manager uses the REST client to do rotations, I need the REST client to initialize the manager, and I need the manager to supply the keys to initialize the REST client.
cmd/kubelet/app/server.go
Outdated
CertificateRotationPercent: rotation, | ||
Rotation: func() { | ||
glog.Info("The client certificate was rotated, restarting so changes take effect.") | ||
os.Exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client GetCertificate is not supported until go1.8. Any suggestions pretaining to how to do this with the kube client? I'd like to point out that this is an experimantal flag and rotations event frequency is very low (O(months)).
@smarterclayton Do you have some time to approve this? |
@k8s-bot cvm gce e2e test this |
Removing PR from 1.6 milestone since code freeze has begun:
/cc @kubernetes/release-team @kubernetes/kubernetes-release-managers |
We are waiting on go 1.8.1 |
1aa5835
to
7961438
Compare
cmd/kubelet/app/server.go
Outdated
if err != nil { | ||
return fmt.Errorf("unable to configure TLS for the rest client: %v", err) | ||
} | ||
tlsConfig.RootCAs = pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tlsConfig can be nil if no customizations were required. in that case you should initialize to an empty one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/kubelet/app/server.go
Outdated
} | ||
return cert, nil | ||
} | ||
clientConfig.Transport = utilnet.SetTransportDefaults(&http.Transport{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are different than what you would have gotten otherwise: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/transport/cache.go#L67
it's hard to tell how important the differences are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO to reconcile the defaults here with the defaults you'd get from calling restclient.TransportFor(config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to match.
clientConfig.KeyData = nil | ||
clientConfig.CertFile = "" | ||
clientConfig.KeyFile = "" | ||
clientConfig.CAData = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty CAFile as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -597,6 +622,94 @@ func run(s *options.KubeletServer, kubeDeps *kubelet.KubeletDeps) (err error) { | |||
return nil | |||
} | |||
|
|||
func updateTransport(clientConfig *restclient.Config, clientCertificateManager certificate.Manager) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if clientConfig already has a Transport set, return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
7f93650
to
3944ba1
Compare
cmd/kubelet/app/server.go
Outdated
tlsConfig.GetClientCertificate = func(requestInfo *tls.CertificateRequestInfo) (*tls.Certificate, error) { | ||
cert := clientCertificateManager.Current() | ||
if cert == nil { | ||
return nil, fmt.Errorf("no certificate available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should return &tls.Certificate{Certificate:nil}, nil
in this case, right? we don't want to abort the handshake, we just don't have a client certificate to return
From the godoc:
If GetClientCertificate returns an error, the handshake will be aborted and that error will be returned. Otherwise GetClientCertificate must return a non-nil Certificate. If Certificate.Certificate is empty then no certificate will be sent to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Done.
f401f13
to
b773ae6
Compare
/lgtm |
/assign @smarterclayton |
/approve |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcbsmpsn, liggitt, mikedanese, smarterclayton The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Changes the kubelet so it bootstraps off the cert/key specified in the config file and uses those to request new cert/key pairs from the Certificate Signing Request API, as well as rotating client certificates when they approach expiration.
b773ae6
to
1519bb9
Compare
Automatic merge from submit-queue (batch tested with PRs 46726, 41912, 46695, 46034, 46551) |
Changes the kubelet so it bootstraps off the cert/key specified in the
config file and uses those to request new cert/key pairs from the
Certificate Signing Request API, as well as rotating client certificates
when they approach expiration.
Default behavior is for client certificate rotation to be disabled. If enabled
using a command line flag, the kubelet exits each time the certificate is
rotated. I tried to use
GetCertificate
in tls.Config but it is only calledon the server side of connections. Then I tried
GetClientCertificate
,but it is new in 1.8.
Release note