Skip to content

Only reap zombie stunnel processes #514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 15, 2021

Conversation

wongma7
Copy link
Contributor

@wongma7 wongma7 commented Jul 15, 2021

Is this a bug fix or adding new feature?

What is this PR about? / Why do we need it?

I suspect our reaper is causing (rarely) mounts to error with 'wait: no child processes'' because it might be catching SIGCHLD and waiting for mount child processes before os.exec can wait for them: https://cs.opensource.google/go/go/+/refs/tags/go1.16.6:src/os/exec_unix.go;l=43;drc=refs%2Ftags%2Fgo1.16.6.

Since the intention of reaper is to clean up zombie stunnel processes which as explained in code comment "become zombies because their parent is the
// driver process but they get terminated by the efs-utils watchdog process:", this PR limits our reaper to cleaning up those.

Arguably maintaining this reaper and having this reaper constantly reading the process table is overkill. In most cases it should be fine to simply let the zombie processes be, the pid limit is 4194303, but in interest of not changing too much to o quickly I'd rather we try this reaper change before we remove the reaper altogether.

(previous try to simply remove reaper: #509)

What testing is done?
build and deploy image.
use dynamic provisioning example (with my own fs of course)

$ k apply -f examples/kubernetes/dynamic_provisioning_test_cluster_1/specs/
$ k exec -n kube-system efs-csi-node-8xgsw -it -- sh
sh-4.2# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.4 119932 17296 ?        Ssl  01:10   0:00 /bin/aws-efs-csi-driver --endpoint=unix:/csi/csi.sock --logtostderr --v=4
root          17  0.1  0.4 110812 18876 ?        S    01:10   0:00 python3 /usr/bin/amazon-efs-mount-watchdog
root          31  0.0  0.1 180072  5888 ?        Ss   01:11   0:00 /usr/bin/stunnel /var/run/efs/stunnel-config.fs-176f8111.var.lib.kubelet.pods.bad00a46
root          55  0.0  0.0  11672  2836 pts/0    Ss   01:12   0:00 sh
root          74  0.0  0.0  51848  3456 pts/0    R+   01:13   0:00 ps aux

delete the pod from the example and then check logs.

$ k delete po efs-app
$ k logs -n kube-system efs-csi-node-8xgsw efs-plugin
I0715 01:13:21.821427       1 reaper.go:103] reaper: waited for process &{31 1 90 31 31 stunnel}
$ k exec -n kube-system efs-csi-node-8xgsw -it -- sh
sh-4.2# cat /var/log/amazon/efs/mount-watchdog.log 
2021-07-15 01:12:51,759 - INFO - No mount found for "fs-176f8111.var.lib.kubelet.pods.bad00a46-fb71-4e48-a653-a533c267d349.volumes.kubernetes.io~csi.pvc-5ce2203f-7dbf-469a-b05d-7d89a8e7122c.mount.20195"
2021-07-15 01:13:21,820 - INFO - Unmount grace period expired for fs-176f8111.var.lib.kubelet.pods.bad00a46-fb71-4e48-a653-a533c267d349.volumes.kubernetes.io~csi.pvc-5ce2203f-7dbf-469a-b05d-7d89a8e7122c.mount.20195
2021-07-15 01:13:21,820 - INFO - Terminating running TLS tunnel - PID: 31, group ID: 31
2021-07-15 01:13:21,820 - INFO - TLS tunnel: 31 is still running, will retry termination
2021-07-15 01:13:22,822 - INFO - Unmount grace period expired for fs-176f8111.var.lib.kubelet.pods.bad00a46-fb71-4e48-a653-a533c267d349.volumes.kubernetes.io~csi.pvc-5ce2203f-7dbf-469a-b05d-7d89a8e7122c.mount.20195
2021-07-15 01:13:22,822 - INFO - TLS tunnel: 31 is no longer running, cleaning up state
sh-4.2# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.4 119932 17296 ?        Ssl  01:10   0:00 /bin/aws-efs-csi-driver --endpoint=unix:/csi/csi.sock --logtostderr --v=4
root          17  0.1  0.4 111068 19028 ?        S    01:10   0:00 python3 /usr/bin/amazon-efs-mount-watchdog
root          55  0.0  0.0  11672  2856 pts/0    Ss+  01:12   0:00 sh
root          76  0.0  0.0  11672  2852 pts/1    Ss   01:22   0:00 sh
root          82  0.0  0.0  51848  3392 pts/1    R+   01:22   0:00 ps aux

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2021
klog.V(4).Infof("Waited for child process %d", childPid)
for _, p := range procs {
reaped := waitIfZombieStunnel(p)
if reaped {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will stop after encountering first zombie process. What happens if there are more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each SIGCHLD received from the channel should result in just one process being reaped. It doesn't matter if the process we are reaping is the same process that got terminated and resulted in us getting a SIGCHLD as long as we always clean up one zombie per SIGCHLD. I'll add a comment that reaping one process per signal received over channel is intentional

Comment on lines -7 to +8
github.com/google/uuid v1.1.1
github.com/kubernetes-csi/csi-test v1.1.1
github.com/kubernetes-csi/csi-test/v3 v3.1.1 // indirect
github.com/mitchellh/go-ps v0.0.0-20170309133038-4fdf99ab2936
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up

@kbasv
Copy link

kbasv commented Jul 15, 2021

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants