Skip to content

[GEP-31] In-Place Node Updates of Shoot Clusters #10828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 30, 2025

Conversation

acumino
Copy link
Member

@acumino acumino commented Nov 11, 2024

Co-authored-by: Ashish Ranjan Yadav [email protected]
Co-authored-by: Shafeeque E S [email protected]

How to categorize this PR?

/area documentation
/kind enhancement

What this PR does / why we need it:
This PR proposes GEP-31 for in-place node updates, allowing seamless updates to Kubernetes and OS versions without node replacement.

Which issue(s) this PR fixes:
Part of #10219

Special notes for your reviewer:

Release note:

NONE

@gardener-prow gardener-prow bot added area/documentation Documentation related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 11, 2024
Copy link
Contributor

@majst01 majst01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still cannot see the benefits of all this implementation effort if the node must be drained anyways and rebooted after the OS update.

This will take the same amount of time as the rolling update, if not more because the in-place OS update is much more error prone and might fail.

I would rather concentrate on the in-place kubernetes minor version update capability alone instead of combining these two features into one GEP where the in-place update of the OS is the far more complex and dangerous one.

@acumino
Copy link
Member Author

acumino commented Nov 11, 2024

I would rather concentrate on the in-place kubernetes minor version update capability alone instead of combining these two features into one GEP where the in-place update of the OS is the far more complex and dangerous one.

From Gardener's POV, supporting in-place updates for only the Kubernetes minor version and supporting in-place updates for both the Kubernetes minor and OS versions are almost the same, so I don't see a reason why they should be worked on separately.

Any OS that wants to support in-place updates can provide a way to do this.
For example, Gardenlinux will support in-place updates of the OS in the future ref.

@unmarshall
Copy link
Contributor

unmarshall commented Nov 14, 2024

I still cannot see the benefits of all this implementation effort if the node must be drained anyways and rebooted after the OS update.
This will take the same amount of time as the rolling update, if not more because the in-place OS update is much more error prone and might fail.

Rolling update is done for cases where you have the luxury to get rid of the existing node and replace it with a new node. This is not the case for certain consumers (bare metal) where the only option is in-place update as they have locally attached storages which cannot be reattached to another node. Secondly rolling updates also have other side effects which potentially causes massive delays:

  1. Availability of sufficient quota for instance type that is under rolling update.
  2. Different providers have different instance launch times and this can take upto several mins (e.g. recently there was a case where openstack took 10mins just to launch a VM, Azure generally takes around 8-10 mins and Aws and GCP takes around 2-3 mins - of course these are rough numbers and are indicative at best) as well. This is not counting the amount of time it takes to drain the node which is the same whether you do in-place + drain or rolling update + drain.

So the overall time it takes to roll all nodes can be much more as compared to in-place. However w.r.t downtime for the workloads that are deployed on the nodes, the time would be the same considering there is drain in both the cases.

@vlerenc had proposed also considering an in-place update without the drain and there were no takers for that option. So while this is not an option today (in this version of GEP) but if there are takers tomorrow then that can also be looked at.

in-place OS update is much more error prone and might fail.

Well, this feature is going to be provided by the Garden Linux team and we will have to make our experience with it. But if you already have some experience doing in-place OS updates and would like to make concrete points on why it is error prone then this can already be considered as a valid input for the GEP and to the Garden linux team as well.

@majst01
Copy link
Contributor

majst01 commented Nov 14, 2024

I still cannot see the benefits of all this implementation effort if the node must be drained anyways and rebooted after the OS update.
This will take the same amount of time as the rolling update, if not more because the in-place OS update is much more error prone and might fail.

Rolling update is done for cases where you have the luxury to get rid of the existing node and replace it with a new node. This is not the case for certain consumers (bare metal) where the only option is in-place update as they have locally attached storages which cannot be reattached to another node. Secondly rolling updates also have other side effects which potentially causes massive delays:

If you have stateful workloads which are running on local storage but you have no spare machines of this machine type, what is the actual use case ? I mean if the hardware breaks for whatever reason, you have nothing left. I hardly can imagine a production workload with such a risky setup.

  1. Availability of sufficient quota for instance type that is under rolling update.

granted

  1. Different providers have different instance launch times and this can take upto several mins (e.g. recently there was a case where openstack took 10mins just to launch a VM, Azure generally takes around 8-10 mins and Aws and GCP takes around 2-3 mins - of course these are rough numbers and are indicative at best) as well. This is not counting the amount of time it takes to drain the node which is the same whether you do in-place + drain or rolling update + drain.

We at metal-stack.io create a new node in about the same time as AWS/GCP (2-3 mins to NodeReady) on bare metal. I thought doing in-place for VMs is not in scope.

So the overall time it takes to roll all nodes can be much more as compared to in-place. However w.r.t downtime for the workloads that are deployed on the nodes, the time would be the same considering there is drain in both the cases.

@vlerenc had proposed also considering an in-place update without the drain and there were no takers for that option. So while this is not an option today (in this version of GEP) but if there are takers tomorrow then that can also be looked at.

in-place OS update is much more error prone and might fail.

Well, this feature is going to be provided by the Garden Linux team and we will have to make our experience with it. But if you already have some experience doing in-place OS updates and would like to make concrete points on why it is error prone then this can already be considered as a valid input for the GEP and to the Garden linux team as well.

This was done already in the past by the folks from CoreOS later FlatCar by doing a A/B Update, not sure if they where able to deal with local storage at all, this was hard work they did and is not easy to implement to a new OS like GardenLinux.

I am not a member of the gardener team and you make the decision, i just want to give my $0.02 based on my experience with bare metal provisioning and OS maintainership.

@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2024

@majst01 I am not yet done with my review, but since you asked:

If you have stateful workloads which are running on local storage but you have no spare machines of this machine type, what is the actual use case ? I mean if the hardware breaks for whatever reason, you have nothing left. I hardly can imagine a production workload with such a risky setup.

It is not really a risky setup. It is what it is in the physical world - unless you have fully automated control over it like with MetalStack or IronCore or OpenStack Ironic. The workload itself is replicated and can survive, to a user-defined degree, "breaking hardware". RAID, PDBs, etc. work with that assumption. It's not that the first machine that goes under means the end of the world/workload for you.

And in fact, the first main use case for this is indeed operating a hardware cluster of bare metal nodes providing Ceph to end users on top. That brought us into this topic and then more such scenarios entered the picture.

@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2024

We at metal-stack.io create a new node in about the same time as AWS/GCP (2-3 mins to NodeReady) on bare metal. I thought doing in-place for VMs is not in scope.

@majst01 With certain parameters (maxSurge and maxUnavailable) that's 2-3 minutes per machine extra, e.g. with maxSurge=1 and maxUnavailable=0 (our default for rolling updates today), a machine gets created, pods get drained, machine gets terminated and then the flow repeats, always adding 2-3 minutes. Only by increasing maxSurge you can speed things up, but then the process may still hang because of individual pods and their PDBs.

In comparison, if you can run with maxSurge=0 and maxUnavailable=1 you never need to wait these 2-3 minutes and if you can only run with maxSurge=1 and maxUnavailable=0 (our default for rolling updates today), you get away often waiting only once 2-3 minutes, because once the additional machine is available, it can take on the evicted pods (rather their replacements) and when you drain the next VM, the other one is updated in-place and ready again (in-place update is much faster than deprovisioning and provisioning again).

But the main reason is not that. I am still doing my review, but one of the suggested changes is to add the following sentence, which addresses why also VMs benefit from in-place updates:

In addition, [in-place updates have also advantages for VMs], e.g. if the VM type is scarce (in general or because of an ongoing capacity crunch) or if certain scarce resources are connected/attached, e.g. GPUs that may not be available for additional or even replaced machines once released (as other customers are waiting for them).

@majst01
Copy link
Contributor

majst01 commented Nov 14, 2024

We at metal-stack.io create a new node in about the same time as AWS/GCP (2-3 mins to NodeReady) on bare metal. I thought doing in-place for VMs is not in scope.

@majst01 With certain parameters (maxSurge and maxUnavailable) that's 2-3 minutes per machine extra, e.g. with maxSurge=1 and maxUnavailable=0 (our default for rolling updates today), a machine gets created, pods get drained, machine gets terminated and then the flow repeats, always adding 2-3 minutes. Only by increasing maxSurge you can speed things up, but then the process may still hang because of individual pods and their PDBs.

In comparison, if you can run with maxSurge=0 and maxUnavailable=1 you never need to wait these 2-3 minutes and if you can only run with maxSurge=1 and maxUnavailable=0 (our default for rolling updates today), you get away often waiting only once 2-3 minutes, because once the additional machine is available, it can take on the evicted pods (rather their replacements) and when you drain the next VM, the other one is updated in-place and ready again (in-place update is much faster than deprovisioning and provisioning again).

How do you want to skip deprovisioning ? Doing a hard reset, risking damaged filesystems is not what you want in the given scenario with local storage.

On the other hand, rebooting a node with lots of pods and mounts will take some time, with big filesystems even longer because the OS must flush all filesystem caches before rebooting.

I am keen to see the real benefit of this approach !

But the main reason is not that. I am still doing my review, but one of the suggested changes is to add the following sentence, which addresses why also VMs benefit from in-place updates:

In addition, [in-place updates have also advantages for VMs], e.g. if the VM type is scarce (in general or because of an ongoing capacity crunch) or if certain scarce resources are connected/attached, e.g. GPUs that may not be available for additional or even replaced machines once released (as other customers are waiting for them).

@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2024

This was done already in the past by the folks from CoreOS later FlatCar by doing a A/B Update, not sure if they where able to deal with local storage at all, this was hard work they did and is not easy to implement to a new OS like GardenLinux.

That was the first thing the Garden Linux team did and we saw it working some months back. Garden Linux can directly boot into a new UKI, so it's actually working and impressingly fast, too. The new image is downloaded while everything is still up and running. Then the pods get drained/evicted and then the reboot happens. If it succeeds, the node is ready pretty fast. If not, it may retry a configurable number of times after which it will give up and reboot the LKG which is still sitting on the disk (as a file (UKI) - like all versions are). It will be configurable whether to purge the file system or not. And yes, we know that CoreOS used to do that, because we used to deploy CoreOS before CoreOS was swallowed by Red Hat which was swallowed by IBM. When Red Hat bought them and the license changed/they wanted to get rid of their competition, we did Garden Linux and can now leverage the deep innovation advantages we have by controlling the OS nearly completely.

I am not a member of the gardener team and you make the decision, i just want to give my $0.02 based on my experience with bare metal provisioning and OS maintainership.

Please don't get me wrong either and thank you for your feedback @majst01. It's not that we "want" to build "in-place updates" (OK, yes we do, but that was not the trigger), but that we are confronted with cases where terminating and recreating machines isn't an option or brings too many disadvantages (scarce resources) that block the classical rolling updates again and again. So, we actually "want" to solve real world issues, not fictive theoretical issues.

@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2024

How do you want to skip deprovisioning? Doing a hard reset, risking damaged filesystems is not what you want in the given scenario with local storage.

Hard reset? You mean via the cloud provider API? No, Garden Linux / the kernel reboots itself.

On the other hand, rebooting a node with lots of pods and mounts will take some time, with big filesystems even longer because the OS must flush all filesystem caches before rebooting.

But there are no pods anymore with the drain option. And should we implement the non-drain option, we would still SIGTERM/SIGKILL the containers first (without draining them). SIGKILL may create problems when the container restarts later, but that's a.) unlikely and b.) at least the OS itself is cleanly rebooting.

In the end, that (SIGKILL-like effects) or worse (e.g. kernel panic, power outage) can happen for whatever reason also today and then the "regular means apply". When the machine comes back and the pods/containers run again, all is good. If not, depending on pod/container or node issues, pods will be evicted or the node will go away as a whole (bare metal machines without programmable API will never go away and will require human interaction, but that's nothing we can solve or aim to solve - that's a side effect of manual bare metal machines).

I am keen to see the real benefit of this approach !

Sure, we will/should do a community demo to show/discuss that openly. You are right.

cc @gehoern

@majst01
Copy link
Contributor

majst01 commented Nov 14, 2024

How do you want to skip deprovisioning? Doing a hard reset, risking damaged filesystems is not what you want in the given scenario with local storage.

Hard reset? You mean via the cloud provider API? No, Garden Linux / the kernel reboots itself.

Yes a reboot will take tens of minutes if not more in the given scenario with hundreds of TB locally in use. Doing a hard reset will risk loosing data in the worst case.

On the other hand, rebooting a node with lots of pods and mounts will take some time, with big filesystems even longer because the OS must flush all filesystem caches before rebooting.

But there are no pods anymore with the drain option. And should we implement the non-drain option, we would still SIGTERM/SIGKILL the containers first (without draining them). SIGKILL may create problems when the container restarts later, but that's a.) unlikely and b.) at least the OS itself is cleanly rebooting.

In the end, that (SIGKILL-like effects) or worse (e.g. kernel panic, power outage) can happen for whatever reason also today and then the "regular means apply". When the machine comes back and the pods/containers run again, all is good. If not, depending on pod/container or node issues, pods will be evicted or the node will go away as a whole (bare metal machines without programmable API will never go away and will require human interaction, but that's nothing we can solve or aim to solve - that's a side effect of manual bare metal machines).

I am keen to see the real benefit of this approach !

Sure, we will/should do a community demo to show/discuss that openly. You are right.

cc @gehoern

@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2024

Yes a reboot will take tens of minutes if not more in the given scenario with hundreds of TB locally in use. Doing a hard reset will risk loosing data in the worst case.

Maybe it's obvious to you, but can you please explain what you mean here @majst01? What TBs do you have in mind and why tens of minutes? I am completely lost what you believe will happen or what I do not understand will happen (and why local data will prolong the reboot).

@majst01
Copy link
Contributor

majst01 commented Nov 14, 2024

Yes a reboot will take tens of minutes if not more in the given scenario with hundreds of TB locally in use. Doing a hard reset will risk loosing data in the worst case.

Maybe it's obvious to you, but can you please explain what you mean here @majst01? What TBs do you have in mind and why tens of minutes? I am completely lost what you believe will happen or what I do not understand will happen (and why local data will prolong the reboot).

You mentioned Ceph as one of the use cases. Ceph node typically have hundreds of TB disks locally paired with several hundreds of GB if not several TB of RAM, that's the whole point. The VFS layer in the linux kernel caches writes in order to speed up writes and reads. These caches must be flushed before power-off. This might take tens of minutes, depending on the speed of your drives. Maybe if you go for a NVMe only solution, this might not be that much.

We have ceph in production ontop of k8s since ~5years and its not fun.

@vlerenc
Copy link
Member

vlerenc commented Nov 14, 2024

You mentioned Ceph as one of the use cases. Ceph node typically have hundreds of TB disks locally paired with several hundreds of GB if not several TB of RAM, that's the whole point.

Yes, indeed, that's the whole point. The point is to avoid replicating any of it.

The VFS layer in the linux kernel caches writes in order to speed up writes and reads. These caches must be flushed before power-off. This might take tens of minutes, depending on the speed of your drives. Maybe if you go for a NVMe only solution, this might not be that much.

Yes, in that case that will be necessary. Much, much better than replicating all data to a newly (re)created machine.

We have ceph in production ontop of k8s since ~5years and its not fun.

I am not pretending to be the expert here, but this request comes from a team that has similar experience and currently uses a non-Gardener Kubernetes setup, but exactly this update strategy. That's why they want Gardener to support the very same strategy: in-place, preserving also the node's identity that is also relevant for the ceph-operator somehow. Rolling is no option. These physical boxes and their physical disks cannot move and data cannot be replicated, because the data volume is enormous. Only if hardware breaks, the system provides the resilience to replace individual failed hardware, but that's an exception that isn't/shouldn't/may not be invoked with a regular Gardener cluster upgrade.

Copy link
Member

@vlerenc vlerenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you very much for the GEP.

However, I have quite many recommendations.

@rfranzke rfranzke changed the title GEP-31: In-Place Node Updates of Shoot Clusters [GEP-31] In-Place Node Updates of Shoot Clusters Nov 18, 2024
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also talk about how this affects the worker pool hash calculations?

@ary1992
Copy link
Contributor

ary1992 commented Nov 22, 2024

/retest

@nkraetzschmar
Copy link

One thing to consider/that would still need to be addressed is how to handle updates to config files modified by the node agent.

Let's say a config file like /etc/containerd/config.toml gets modified by the node agent. Now if a new GardenLinux version ships an updated version of the default config, this part of the update would be shadowed by the modified version.
So even after the update you would keep the modified version of the old default config, not the similar modifications applied to the new default.

Therefore no updates to config files are possible once the node agent has modified them.
You might be able to do things like 3-way merge, but those are not guaranteed to always apply without manual intervention on merge conflicts, which would obviously not be feasible.

With the current design / implementation of in-place updates on the GL side this shadowing of updated files would be unnoticed.

At the moment all modifications to /etc are done via an overlay backed by /var/etc.overlay. The way overlayfs works, once a file gets modified all of its contents get copied to the upperdir (/var/etc.overlay in this case). If after modification the same file is changed in the lowerdir this has zero noticeable effect.

The first step would probably be to make such shadowing detectable in the first place, by e.g. also providing access to the lowerdir at some special mount point + on first boot when the overlay gets initialised store a hash of all original files.
Then after system update we could check if there are any files in the overlay where the stored hash does not match the current value in the lowerdir, this means it was updated both by an in-place update and the node agent => shadowing of the OS update occured. (Ideally this would just be done by the ovelayfs directly, but modifying the kernel just for this seems overkill).

Now how on such a detection the error could be resolved remains an open question and can only really be addressed by gardener / gardener-node-agent as understanding of which modifications are required requires knowledge not available to the OS.

@rfranzke
Copy link
Member

Couldn't we create our own containerd config file at some other place and start containerd with this instead of the default config provided via the OS?

Copy link
Member

@vlerenc vlerenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reading through my comments and making changes. I replied in-line to a few follow-up questions that were raised.

@acumino acumino requested a review from unmarshall January 24, 2025 03:22
@shafeeqes shafeeqes requested a review from unmarshall January 28, 2025 11:26
@gardener-prow gardener-prow bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm Indicates that a PR is ready to be merged. labels Jan 28, 2025
@shafeeqes shafeeqes force-pushed the gep/inplace-node-update branch from bbfaa4b to 2d611d1 Compare January 29, 2025 02:58
@gardener-prow gardener-prow bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2025
@shafeeqes shafeeqes force-pushed the gep/inplace-node-update branch from 2017e18 to 41cb6e6 Compare January 29, 2025 12:34
Copy link
Contributor

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2025
Copy link
Contributor

gardener-prow bot commented Jan 30, 2025

LGTM label has been added.

Git tree hash: 3861b4ca17b3898c3652b81721bee66e26515c48

@acumino
Copy link
Member Author

acumino commented Jan 30, 2025

/approve
Thank you everyone for reviewing.

Copy link
Contributor

gardener-prow bot commented Jan 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: acumino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2025
@gardener-prow gardener-prow bot merged commit 63a4023 into gardener:master Jan 30, 2025
18 checks passed
@shafeeqes shafeeqes deleted the gep/inplace-node-update branch January 31, 2025 04:46
Roncossek pushed a commit to Roncossek/gardener that referenced this pull request Jan 31, 2025
* GEP-31: In-Place Node Updates of Shoot Clusters

Co-authored-by: Ashish Ranjan Yadav  <[email protected]>
Co-authored-by: Shafeeque E S <[email protected]>

* Address PR review feedback

* Address PR review feedback

* Apply suggestions from code review

* Minor cosmetics

* Address PR review feedback

* Add reviewers

* Address PR review feedback

* Rename `AutoReplaceUpdate` to `AutoRollingUpdate`

---------

Co-authored-by: Ashish Ranjan Yadav <[email protected]>
Co-authored-by: Shafeeque E S <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Documentation related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.