Skip to content

[node-agent] Delete systemd unit files and drop-ins only if the unit has been created by node-agent #10918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

oliver-goetz
Copy link
Member

How to categorize this PR?

/area robustness|
/kind bug

What this PR does / why we need it:
Currently gardener-node-agent always deletes a systemd unit file and the drop-in folder when the unit was deleted from OperatingSystemConfig.
While this correct for units which have been created by gardener-node-agent, it incorrectly deletes *.service file and the drop-in folder for units created by other parties (like default OS units). This can be the case if we use the unit to add additional drop-ins or in order to change the start command.

With this PR, unit file and drop-in folder are deleted only if Unit.Content of the old unit is not nil.
Otherwise, it keeps the unit file and deletes the drop-in files only.

Which issue(s) this PR fixes:
Fixes #10809

Special notes for your reviewer:
/cc @rfranzke

Release note:

`gardener-node-agent` deletes unit files and drop-ins only if it created them previously.

@gardener-prow gardener-prow bot requested a review from rfranzke November 25, 2024 18:46
@gardener-prow gardener-prow bot added the kind/bug Bug label Nov 25, 2024
Copy link
Contributor

gardener-prow bot commented Nov 25, 2024

@oliver-goetz: The label(s) area/robustness| cannot be applied, because the repository doesn't have them.

In response to this:

How to categorize this PR?

/area robustness|
/kind bug

What this PR does / why we need it:
Currently gardener-node-agent always deletes a systemd unit file and the drop-in folder when the unit was deleted from OperatingSystemConfig.
While this correct for units which have been created by gardener-node-agent, it incorrectly deletes *.service file and the drop-in folder for units created by other parties (like default OS units). This can be the case if we use the unit to add additional drop-ins or in order to change the start command.

With this PR, unit file and drop-in folder are deleted only if Unit.Content of the old unit is not nil.
Otherwise, it keeps the unit file and deletes the drop-in files only.

Which issue(s) this PR fixes:
Fixes #10809

Special notes for your reviewer:
/cc @rfranzke

Release note:

`gardener-node-agent` deletes unit files and drop-ins only if it created them previously.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 25, 2024
@oliver-goetz
Copy link
Member Author

/area robustness

@gardener-prow gardener-prow bot added the area/robustness Robustness, reliability, resilience related label Nov 25, 2024
@oliver-goetz oliver-goetz force-pushed the fix/node-agent-unit-deletion branch from c0c2950 to 801339b Compare November 26, 2024 06:43
@gardener-prow gardener-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 26, 2024
Copy link
Member

@LucaBernstein LucaBernstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 26, 2024
Copy link
Contributor

gardener-prow bot commented Nov 26, 2024

LGTM label has been added.

Git tree hash: f12bbf3d3b34aff1e90b0ad118f6ce774fb21442

@rfranzke
Copy link
Member

/assign

@oliver-goetz oliver-goetz force-pushed the fix/node-agent-unit-deletion branch from 801339b to 1f5d562 Compare November 28, 2024 13:33
@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2024
Copy link
Contributor

gardener-prow bot commented Nov 28, 2024

New changes are detected. LGTM label has been removed.

Copy link
Contributor

gardener-prow bot commented Nov 28, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lucabernstein. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 28, 2024
@oliver-goetz oliver-goetz requested a review from rfranzke December 1, 2024 16:06
@rfranzke
Copy link
Member

rfranzke commented Dec 9, 2024

@maboehm Since you recently looked into a similar topic, do you want to take a look at this one?

Copy link
Contributor

@maboehm maboehm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got the gist of the implementation, in general it looks good to me. Just have one suggestion for better documentation.

Comment on lines +361 to +362
} else if ok {
return nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I spent a couple of minutes understanding why you return nil is an opertion exists in the map - Maybe the function name isn't the best, or we can add a comment explaining the behavior here.
Because here returning nil means "I have already determined the state of this file", which is a surprising result of a function getState()

@oliver-goetz
Copy link
Member Author

In this PR there is one open point which is the initial state. When the machine is booted for the first time we apply the initial configuration via user data or ignition.
These might include systemd units like sshd-ensurer when SSH access is enabled. We have to add this unit to the state, that GNA could deactivate it later again in case SSH access would be disabled.
At this point things become complicated, because GNA is not running yet and there are different ways to start it. Additionally, ssd-ensurer is just an example for the problem. Extensions might add their own systemd units for bootstrapping too.

@rfranzke and me had a chat about this open point. We came to the conclusion that the node-agent file system is a good solution which addresses the wrong problem 😅

Thus, we go back to an adapted version of the original idea. I opened #11015 for it. We can keep this PR in case we find the right problem for it.

/close

@gardener-prow gardener-prow bot closed this Dec 10, 2024
Copy link
Contributor

gardener-prow bot commented Dec 10, 2024

@oliver-goetz: Closed this PR.

In response to this:

In this PR there is one open point which is the initial state. When the machine is booted for the first time we apply the initial configuration via user data or ignition.
These might include systemd units like sshd-ensurer when SSH access is enabled. We have to add this unit to the state, that GNA could deactivate it later again in case SSH access would be disabled.
At this point things become complicated, because GNA is not running yet and there are different ways to start it. Additionally, ssd-ensurer is just an example for the problem. Extensions might add their own systemd units for bootstrapping too.

@rfranzke and me had a chat about this open point. We came to the conclusion that the node-agent file system is a good solution which addresses the wrong problem 😅

Thus, we go back to an adapted version of the original idea. I opened #11015 for it. We can keep this PR in case we find the right problem for it.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gardener Node Agent deletes containerd drop-in directory
4 participants