Only reap zombie stunnel processes #514

wongma7 · 2021-07-15T01:22:01Z

Is this a bug fix or adding new feature?

What is this PR about? / Why do we need it?

I suspect our reaper is causing (rarely) mounts to error with 'wait: no child processes'' because it might be catching SIGCHLD and waiting for mount child processes before os.exec can wait for them: https://cs.opensource.google/go/go/+/refs/tags/go1.16.6:src/os/exec_unix.go;l=43;drc=refs%2Ftags%2Fgo1.16.6.

Since the intention of reaper is to clean up zombie stunnel processes which as explained in code comment "become zombies because their parent is the
// driver process but they get terminated by the efs-utils watchdog process:", this PR limits our reaper to cleaning up those.

Arguably maintaining this reaper and having this reaper constantly reading the process table is overkill. In most cases it should be fine to simply let the zombie processes be, the pid limit is 4194303, but in interest of not changing too much to o quickly I'd rather we try this reaper change before we remove the reaper altogether.

(previous try to simply remove reaper: #509)

What testing is done?
build and deploy image.
use dynamic provisioning example (with my own fs of course)

$ k apply -f examples/kubernetes/dynamic_provisioning_test_cluster_1/specs/
$ k exec -n kube-system efs-csi-node-8xgsw -it -- sh
sh-4.2# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.4 119932 17296 ?        Ssl  01:10   0:00 /bin/aws-efs-csi-driver --endpoint=unix:/csi/csi.sock --logtostderr --v=4
root          17  0.1  0.4 110812 18876 ?        S    01:10   0:00 python3 /usr/bin/amazon-efs-mount-watchdog
root          31  0.0  0.1 180072  5888 ?        Ss   01:11   0:00 /usr/bin/stunnel /var/run/efs/stunnel-config.fs-176f8111.var.lib.kubelet.pods.bad00a46
root          55  0.0  0.0  11672  2836 pts/0    Ss   01:12   0:00 sh
root          74  0.0  0.0  51848  3456 pts/0    R+   01:13   0:00 ps aux

delete the pod from the example and then check logs.

$ k delete po efs-app
$ k logs -n kube-system efs-csi-node-8xgsw efs-plugin
I0715 01:13:21.821427       1 reaper.go:103] reaper: waited for process &{31 1 90 31 31 stunnel}
$ k exec -n kube-system efs-csi-node-8xgsw -it -- sh
sh-4.2# cat /var/log/amazon/efs/mount-watchdog.log 
2021-07-15 01:12:51,759 - INFO - No mount found for "fs-176f8111.var.lib.kubelet.pods.bad00a46-fb71-4e48-a653-a533c267d349.volumes.kubernetes.io~csi.pvc-5ce2203f-7dbf-469a-b05d-7d89a8e7122c.mount.20195"
2021-07-15 01:13:21,820 - INFO - Unmount grace period expired for fs-176f8111.var.lib.kubelet.pods.bad00a46-fb71-4e48-a653-a533c267d349.volumes.kubernetes.io~csi.pvc-5ce2203f-7dbf-469a-b05d-7d89a8e7122c.mount.20195
2021-07-15 01:13:21,820 - INFO - Terminating running TLS tunnel - PID: 31, group ID: 31
2021-07-15 01:13:21,820 - INFO - TLS tunnel: 31 is still running, will retry termination
2021-07-15 01:13:22,822 - INFO - Unmount grace period expired for fs-176f8111.var.lib.kubelet.pods.bad00a46-fb71-4e48-a653-a533c267d349.volumes.kubernetes.io~csi.pvc-5ce2203f-7dbf-469a-b05d-7d89a8e7122c.mount.20195
2021-07-15 01:13:22,822 - INFO - TLS tunnel: 31 is no longer running, cleaning up state
sh-4.2# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.4 119932 17296 ?        Ssl  01:10   0:00 /bin/aws-efs-csi-driver --endpoint=unix:/csi/csi.sock --logtostderr --v=4
root          17  0.1  0.4 111068 19028 ?        S    01:10   0:00 python3 /usr/bin/amazon-efs-mount-watchdog
root          55  0.0  0.0  11672  2856 pts/0    Ss+  01:12   0:00 sh
root          76  0.0  0.0  11672  2852 pts/1    Ss   01:22   0:00 sh
root          82  0.0  0.0  51848  3392 pts/1    R+   01:22   0:00 ps aux

k8s-ci-robot · 2021-07-15T01:22:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wongma7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kbasv · 2021-07-15T15:21:34Z

pkg/driver/reaper.go

-				klog.V(4).Infof("Waited for child process %d", childPid)
+				for _, p := range procs {
+					reaped := waitIfZombieStunnel(p)
+					if reaped {


This will stop after encountering first zombie process. What happens if there are more?

each SIGCHLD received from the channel should result in just one process being reaped. It doesn't matter if the process we are reaping is the same process that got terminated and resulted in us getting a SIGCHLD as long as we always clean up one zombie per SIGCHLD. I'll add a comment that reaping one process per signal received over channel is intentional

kbasv · 2021-07-15T19:05:34Z

go.mod

-	github.com/google/uuid v1.1.1
 	github.com/kubernetes-csi/csi-test v1.1.1
-	github.com/kubernetes-csi/csi-test/v3 v3.1.1 // indirect
+	github.com/mitchellh/go-ps v0.0.0-20170309133038-4fdf99ab2936


Thanks for cleaning up

kbasv · 2021-07-15T19:06:06Z

/lgtm

k8s-ci-robot requested review from jqmichael and justinsb July 15, 2021 01:22

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2021

kbasv reviewed Jul 15, 2021

View reviewed changes

Only reap zombie stunnel processes

ee07594

wongma7 force-pushed the reaper branch from e56d8f4 to ee07594 Compare July 15, 2021 18:54

kbasv reviewed Jul 15, 2021

View reviewed changes

k8s-ci-robot assigned kbasv Jul 15, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 15, 2021

k8s-ci-robot merged commit 388c3f9 into kubernetes-sigs:master Jul 15, 2021

chrishenzie mentioned this pull request Aug 30, 2021

[release-1.3] cherry-pick: Only reap zombie stunnel processes #544

Merged

This was referenced Oct 19, 2021

Update mount-utils to v0.23.0-rc.0 to pick up 'wait: no child processes' change #571

Closed

[release-0.3] Ignore 'wait: no child processes' error when calling mount/umount (cherry-pick of kubernetes/kubernetes#103780) #572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only reap zombie stunnel processes #514

Only reap zombie stunnel processes #514

Uh oh!

wongma7 commented Jul 15, 2021 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jul 15, 2021

Uh oh!

kbasv Jul 15, 2021

Uh oh!

wongma7 Jul 15, 2021

Uh oh!

kbasv Jul 15, 2021

Uh oh!

kbasv commented Jul 15, 2021

Uh oh!

Uh oh!

Only reap zombie stunnel processes #514

Only reap zombie stunnel processes #514

Uh oh!

Conversation

wongma7 commented Jul 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 15, 2021

Uh oh!

kbasv Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

wongma7 Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

kbasv Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

kbasv commented Jul 15, 2021

Uh oh!

Uh oh!

wongma7 commented Jul 15, 2021 •

edited

Loading