Description
What would you like to be added:
We like/need to offer, for all spec changes, a way to update cluster nodes in-place (all except for the ones where we have no other technical alternative, e.g. switching out physical machines will always require node rolling).
For these in-place node updates, we need to understand what exactly should to be done:
- Roll the nodes if otherwise technically impossible (a physical machine is no longer operational or goes into maintenance)
- Update the node (with or without drain; has its pros and cons) and reboot (e.g. OS update)
- Drain the node before we update it and restart the
kubelet
(e.g. K8s minor update, see ref) - Just update the node (w/o drain) and restart the
kubelet
(e.g. K8s patch update, see ref)
All of the above (and maybe more cases) need to be detected and should trigger different forms of in-place node updates.
Note: Maybe some stakeholders would want to avoid an actual drain that would move the pod to another machine. In this case, we might get away with "local cleaning" (files/containers) after we stop the kubelet. When it restarts, it won't know that it used to run these containers. Would that be of any help to some users?
Until now, Gardener only supported the very first and last use case. Now we need to support also all the other use cases, trying to hold on to the machines (whether physical or virtual) and their Kubernetes node resources indefinitely, while keeping the impact under update as minimal as possible.
Why is this needed:
Physical machines/bare metal nodes (whether manually joined or programmatically provisioned) cannot be easily rolled like we roll today virtual machines. Physical machines/bare metal nodes are rather "pets", not "cattle" (like virtual machines). Usually:
- They have long boot times, unless they already fully booted and can be (UEFI) fast-booted without running certain checks like POST (Power-On Self-Test, e.g. memory tests)
- They have long cleanup/sanitization times to erase all data before returning them to a pool
- They have locally attached disks that cannot be "moved" like network-attached volumes that can be "re-attached" easily
- They are rather large in size and limited in number and not as interchangeable and sliceable as virtual machines
This is why we need to find a way to preserve them/never roll this kind nodes. In some cases, this may be also beneficial to brownfield/singleton applications/services that are rather sensitive to node rolling/disruptions in general.
Tasks
- Proof of Concept and Design Considerations
- Perform a small PoC to validate the design ideas and the overall approach
- [GEP-31] In-Place Node Updates of Shoot Clusters #10828
- Gardener
- API changes
-
CloudProfie
[GEP 31] Introduce API changes for supporting InPlaceUpdate #11191 -
Shoot
[GEP 31] Introduce API changes for supporting InPlaceUpdate #11191 -
Worker
Extesnsion [GEP 31] Introduce API changes for supporting InPlaceUpdate #11191 -
OperatingSystemConfig
Extension [GEP 31] Introduce API changes for supporting InPlaceUpdate #11191
-
- [GEP-31] Adapt
OperatingSystemConfig
reconciler and extension to populate fields related to in-place update #11393 - [GEP 31] Adapt API changes for
machine-controller-manager
#11631 - [GEP-31] Adapt
Worker
reconciler for in-place update #11713 - [GEP-31] Adapt
gardener-node-agent
to handle in-place node updates #11718 - [GEP-31] Adapt
Shoot
reconciler for in-place update #11843 - [GEP-31] Introduce
Shoot
status reconciler for maintaining in-place update status #11844 - Vendor MCM with in-place support - Update machine-controller-manager to v0.58.0 (minor) #11963
- [GEP-31] Controller optimizations and Add e2e tests #11953
- [GEP-31] Support inplace update for shoot with OSC version
1
#12055 - [GEP-31] Support enablement of
NewWorkerPoolHash
feature gate for Shoots with InPlace updates #12118 - [GEP-31] Add usage doc for in-place update #12005
- Add TestMachinery test [GEP-31] Add TM test for worker pools with InPlace update strategy #11990
- API changes
- Provider-Extensions Vendor Gardener and adapt provider extensions to support in-place update.
- provider-alicloud: [GEP-31] Adapt extension to support InPlace updates gardener-extension-provider-alicloud#795
- provider-aws: [GEP-31] Adapt extension to support in-place updates gardener-extension-provider-aws#1276
- provider-gcp [GEP-31] Adapt extension to support InPlace update gardener-extension-provider-gcp#1069
- provider-azure [GEP-31] Adapt extension to support InPlace update gardener-extension-provider-azure#1181
- provider-openstack [GEP-31] Adapt extension to support InPlace updates gardener-extension-provider-openstack#1054
- OS-Extension - Vendor Gardener and adapt OS extensions of OS which support in-Place Update.