&#9730;&#65039; [GEP-31] Support for In-Place Node Updates

**What would you like to be added**:
We like/need to offer, for all spec changes, a way to update cluster nodes in-place (all except for the ones where we have no other technical alternative, e.g. switching out physical machines will always require node rolling).

For these in-place node updates, we need to understand what exactly should to be done:
- Roll the nodes if otherwise technically impossible (a physical machine is no longer operational or goes into maintenance)
- Update the node (with or without drain; has its pros and cons) and reboot (e.g. OS update)
- Drain the node before we update it and restart the `kubelet` (e.g. K8s minor update, see [ref](https://kubernetes.io/releases/version-skew-policy/#kubelet-1))
- Just update the node (w/o drain) and restart the `kubelet` (e.g. K8s patch update, see [ref](https://kubernetes.io/releases/version-skew-policy/#kubelet-1))

All of the above (and maybe more cases) need to be detected and should trigger different forms of in-place node updates.

Note: Maybe some stakeholders would want to avoid an actual drain that would move the pod to another machine. In this case, we might get away with "local cleaning" (files/containers) after we stop the kubelet. When it restarts, it won't know that it used to run these containers. Would that be of any help to some users?

Until now, Gardener only supported the very first and last use case. Now we need to support also all the other use cases, trying to hold on to the machines (whether physical or virtual) and their Kubernetes node resources indefinitely, while keeping the impact under update as minimal as possible.

**Why is this needed**:
Physical machines/bare metal nodes (whether manually joined or programmatically provisioned) cannot be easily rolled like we roll today virtual machines. Physical machines/bare metal nodes are rather "pets", not "cattle" (like virtual machines). Usually:
- They have long boot times, unless they already fully booted and can be (UEFI) fast-booted without running certain checks like POST (Power-On Self-Test, e.g. memory tests)
- They have long cleanup/sanitization times to erase all data before returning them to a pool
- They have locally attached disks that cannot be "moved" like network-attached volumes that can be "re-attached" easily
- They are rather large in size and limited in number and not as interchangeable and sliceable as virtual machines

This is why we need to find a way to preserve them/never roll this kind nodes. In some cases, this may be also beneficial to brownfield/singleton applications/services that are rather sensitive to node rolling/disruptions in general.


### Tasks
- Proof of Concept and Design Considerations
  - [x] Perform a small PoC to validate the design ideas and the overall approach
  - [x] https://github.com/gardener/gardener/pull/10828
- **Gardener**
  - [x] API changes
    - [x] `CloudProfie` https://github.com/gardener/gardener/pull/11191
    - [x] `Shoot` https://github.com/gardener/gardener/pull/11191
    - [x] `Worker` Extesnsion https://github.com/gardener/gardener/pull/11191
    - [x] `OperatingSystemConfig` Extension https://github.com/gardener/gardener/pull/11191
  - [x]  https://github.com/gardener/gardener/pull/11393
  - [x] https://github.com/gardener/gardener/pull/11631
  - [x] https://github.com/gardener/gardener/pull/11713
  - [x] https://github.com/gardener/gardener/pull/11718
    - [x] https://github.com/gardener/gardener/pull/11964
  - [x] https://github.com/gardener/gardener/pull/11843
  - [x] https://github.com/gardener/gardener/pull/11844
  - [x] Vendor MCM with in-place support - https://github.com/gardener/gardener/pull/11963
  - [x] https://github.com/gardener/gardener/pull/11953
  - [x] https://github.com/gardener/gardener/pull/12055
  - [ ] https://github.com/gardener/gardener/pull/12118
  - [ ] https://github.com/gardener/gardener/pull/12005
  - [x] Add TestMachinery test https://github.com/gardener/gardener/pull/11990
- **Provider-Extensions** Vendor Gardener and adapt provider extensions to support in-place update.
   - [ ] [provider-alicloud](https://github.com/gardener/gardener-extension-provider-alicloud): https://github.com/gardener/gardener-extension-provider-alicloud/pull/795
   - [ ] [provider-aws](https://github.com/gardener/gardener-extension-provider-aws): https://github.com/gardener/gardener-extension-provider-aws/pull/1276
   - [ ] [provider-gcp](https://github.com/gardener/gardener-extension-provider-gcp) https://github.com/gardener/gardener-extension-provider-gcp/pull/1069
   - [ ] [provider-azure](https://github.com/gardener/gardener-extension-provider-azure) https://github.com/gardener/gardener-extension-provider-azure/pull/1181
   - [ ] [provider-openstack](https://github.com/gardener/gardener-extension-provider-openstack) https://github.com/gardener/gardener-extension-provider-openstack/pull/1054
- **OS-Extension** - Vendor Gardener and adapt OS extensions of OS which support in-Place Update.
  - [ ] [os-gardenlinux](https://github.com/gardener/gardener-extension-os-gardenlinux/): https://github.com/gardener/gardener-extension-os-gardenlinux/pull/260

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ [GEP-31] Support for In-Place Node Updates #10219

Tasks

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

☂️ [GEP-31] Support for In-Place Node Updates #10219

Description

Tasks

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions