-
Notifications
You must be signed in to change notification settings - Fork 513
[GEP-31] In-Place Node Updates of Shoot Clusters #10828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEP-31] In-Place Node Updates of Shoot Clusters #10828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still cannot see the benefits of all this implementation effort if the node must be drained anyways and rebooted after the OS update.
This will take the same amount of time as the rolling update, if not more because the in-place OS update is much more error prone and might fail.
I would rather concentrate on the in-place kubernetes minor version update capability alone instead of combining these two features into one GEP where the in-place update of the OS is the far more complex and dangerous one.
From Gardener's POV, supporting in-place updates for only the Any OS that wants to support in-place updates can provide a way to do this. |
Rolling update is done for cases where you have the luxury to get rid of the existing node and replace it with a new node. This is not the case for certain consumers (bare metal) where the only option is in-place update as they have locally attached storages which cannot be reattached to another node. Secondly rolling updates also have other side effects which potentially causes massive delays:
So the overall time it takes to roll all nodes can be much more as compared to in-place. However w.r.t downtime for the workloads that are deployed on the nodes, the time would be the same considering there is drain in both the cases. @vlerenc had proposed also considering an in-place update without the drain and there were no takers for that option. So while this is not an option today (in this version of GEP) but if there are takers tomorrow then that can also be looked at.
Well, this feature is going to be provided by the Garden Linux team and we will have to make our experience with it. But if you already have some experience doing in-place OS updates and would like to make concrete points on why it is error prone then this can already be considered as a valid input for the GEP and to the Garden linux team as well. |
If you have stateful workloads which are running on local storage but you have no spare machines of this machine type, what is the actual use case ? I mean if the hardware breaks for whatever reason, you have nothing left. I hardly can imagine a production workload with such a risky setup.
granted
We at metal-stack.io create a new node in about the same time as AWS/GCP (2-3 mins to NodeReady) on bare metal. I thought doing in-place for VMs is not in scope.
This was done already in the past by the folks from CoreOS later FlatCar by doing a A/B Update, not sure if they where able to deal with local storage at all, this was hard work they did and is not easy to implement to a new OS like GardenLinux. I am not a member of the gardener team and you make the decision, i just want to give my $0.02 based on my experience with bare metal provisioning and OS maintainership. |
@majst01 I am not yet done with my review, but since you asked:
It is not really a risky setup. It is what it is in the physical world - unless you have fully automated control over it like with MetalStack or IronCore or OpenStack Ironic. The workload itself is replicated and can survive, to a user-defined degree, "breaking hardware". RAID, PDBs, etc. work with that assumption. It's not that the first machine that goes under means the end of the world/workload for you. And in fact, the first main use case for this is indeed operating a hardware cluster of bare metal nodes providing Ceph to end users on top. That brought us into this topic and then more such scenarios entered the picture. |
@majst01 With certain parameters ( In comparison, if you can run with But the main reason is not that. I am still doing my review, but one of the suggested changes is to add the following sentence, which addresses why also VMs benefit from in-place updates:
|
How do you want to skip deprovisioning ? Doing a hard reset, risking damaged filesystems is not what you want in the given scenario with local storage. On the other hand, rebooting a node with lots of pods and mounts will take some time, with big filesystems even longer because the OS must flush all filesystem caches before rebooting. I am keen to see the real benefit of this approach !
|
That was the first thing the Garden Linux team did and we saw it working some months back. Garden Linux can directly boot into a new UKI, so it's actually working and impressingly fast, too. The new image is downloaded while everything is still up and running. Then the pods get drained/evicted and then the reboot happens. If it succeeds, the node is ready pretty fast. If not, it may retry a configurable number of times after which it will give up and reboot the LKG which is still sitting on the disk (as a file (UKI) - like all versions are). It will be configurable whether to purge the file system or not. And yes, we know that CoreOS used to do that, because we used to deploy CoreOS before CoreOS was swallowed by Red Hat which was swallowed by IBM. When Red Hat bought them and the license changed/they wanted to get rid of their competition, we did Garden Linux and can now leverage the deep innovation advantages we have by controlling the OS nearly completely.
Please don't get me wrong either and thank you for your feedback @majst01. It's not that we "want" to build "in-place updates" (OK, yes we do, but that was not the trigger), but that we are confronted with cases where terminating and recreating machines isn't an option or brings too many disadvantages (scarce resources) that block the classical rolling updates again and again. So, we actually "want" to solve real world issues, not fictive theoretical issues. |
Hard reset? You mean via the cloud provider API? No, Garden Linux / the kernel reboots itself.
But there are no pods anymore with the drain option. And should we implement the non-drain option, we would still SIGTERM/SIGKILL the containers first (without draining them). SIGKILL may create problems when the container restarts later, but that's a.) unlikely and b.) at least the OS itself is cleanly rebooting. In the end, that (SIGKILL-like effects) or worse (e.g. kernel panic, power outage) can happen for whatever reason also today and then the "regular means apply". When the machine comes back and the pods/containers run again, all is good. If not, depending on pod/container or node issues, pods will be evicted or the node will go away as a whole (bare metal machines without programmable API will never go away and will require human interaction, but that's nothing we can solve or aim to solve - that's a side effect of manual bare metal machines).
Sure, we will/should do a community demo to show/discuss that openly. You are right. cc @gehoern |
Yes a reboot will take tens of minutes if not more in the given scenario with hundreds of TB locally in use. Doing a hard reset will risk loosing data in the worst case.
|
Maybe it's obvious to you, but can you please explain what you mean here @majst01? What TBs do you have in mind and why tens of minutes? I am completely lost what you believe will happen or what I do not understand will happen (and why local data will prolong the reboot). |
You mentioned Ceph as one of the use cases. Ceph node typically have hundreds of TB disks locally paired with several hundreds of GB if not several TB of RAM, that's the whole point. The VFS layer in the linux kernel caches writes in order to speed up writes and reads. These caches must be flushed before power-off. This might take tens of minutes, depending on the speed of your drives. Maybe if you go for a NVMe only solution, this might not be that much. We have ceph in production ontop of k8s since ~5years and its not fun. |
Yes, indeed, that's the whole point. The point is to avoid replicating any of it.
Yes, in that case that will be necessary. Much, much better than replicating all data to a newly (re)created machine.
I am not pretending to be the expert here, but this request comes from a team that has similar experience and currently uses a non-Gardener Kubernetes setup, but exactly this update strategy. That's why they want Gardener to support the very same strategy: in-place, preserving also the node's identity that is also relevant for the ceph-operator somehow. Rolling is no option. These physical boxes and their physical disks cannot move and data cannot be replicated, because the data volume is enormous. Only if hardware breaks, the system provides the resilience to replace individual failed hardware, but that's an exception that isn't/shouldn't/may not be invoked with a regular Gardener cluster upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you very much for the GEP.
However, I have quite many recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also talk about how this affects the worker pool hash calculations?
/retest |
One thing to consider/that would still need to be addressed is how to handle updates to config files modified by the node agent. Let's say a config file like Therefore no updates to config files are possible once the node agent has modified them. With the current design / implementation of in-place updates on the GL side this shadowing of updated files would be unnoticed. At the moment all modifications to The first step would probably be to make such shadowing detectable in the first place, by e.g. also providing access to the Now how on such a detection the error could be resolved remains an open question and can only really be addressed by gardener / gardener-node-agent as understanding of which modifications are required requires knowledge not available to the OS. |
Couldn't we create our own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reading through my comments and making changes. I replied in-line to a few follow-up questions that were raised.
Co-authored-by: Ashish Ranjan Yadav <[email protected]> Co-authored-by: Shafeeque E S <[email protected]>
bbfaa4b
to
2d611d1
Compare
2017e18
to
41cb6e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 3861b4ca17b3898c3652b81721bee66e26515c48
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: acumino The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* GEP-31: In-Place Node Updates of Shoot Clusters Co-authored-by: Ashish Ranjan Yadav <[email protected]> Co-authored-by: Shafeeque E S <[email protected]> * Address PR review feedback * Address PR review feedback * Apply suggestions from code review * Minor cosmetics * Address PR review feedback * Add reviewers * Address PR review feedback * Rename `AutoReplaceUpdate` to `AutoRollingUpdate` --------- Co-authored-by: Ashish Ranjan Yadav <[email protected]> Co-authored-by: Shafeeque E S <[email protected]>
Co-authored-by: Ashish Ranjan Yadav [email protected]
Co-authored-by: Shafeeque E S [email protected]
How to categorize this PR?
/area documentation
/kind enhancement
What this PR does / why we need it:
This PR proposes GEP-31 for in-place node updates, allowing seamless updates to Kubernetes and OS versions without node replacement.
Which issue(s) this PR fixes:
Part of #10219
Special notes for your reviewer:
Release note: