rr --wait stuck not reaping zombie process

I've got a very occasional issue where `rr --wait` is not properly reaping some child processes that have exited, and thus rr itself is stuck forever and not exiting.

The situation looks like this. There's an rr process, pid 1566, which has no traced children except for the defunct zombie process 1675.

```
[root@2afe6f2c30c3 /]# ps auxf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1759  0.0  0.0   4984  3760 ?        S    06:27   0:00 -bash
root        1788  0.0  0.0   5796  3648 ?        R+   06:27   0:00  \_ ps auxf
minipc-+       1  0.0  0.0   2416  1240 ?        Ss   Nov13   0:00 /usr/bin/dumb-init -- ../build-ruby.rb --test-tool --rr --asan --pernosco
minipc-+       7  0.0  0.3 1889412 198044 ?      Ssl  Nov13   0:53 /usr/bin/ruby-mri ../build-ruby.rb --test-tool --rr --asan --pernosco
minipc-+    1566  0.0  0.1 933848 126316 ?       tl   Nov13   0:08  \_ rr record --output-trace-dir test_output_dir/tool__test__testunit__test_parallel/rr_trace --asan --wait --disable-avx-512 --disable-cpuid-features 0x80050440,0x40140400 --disable-cpuid-features-ext 0xc405814,0xe73fa021,0x3eff8ef --disable-cpuid-
minipc-+    1675  0.0  0.0      0     0 ?        Zl   Nov13   0:00 [ruby] <defunct>
```

Process 1675 has two threads. The thread-group-leader is in a zombie state, but the second thread is in a ptrace-stop:

```
[root@2afe6f2c30c3 /]# cat /proc/1675/status
Name:   ruby
State:  Z (zombie)
...
[root@2afe6f2c30c3 /]# cat /proc/1675/task/1681/status 
Name:   parallel.rb:40
Umask:  0022
State:  t (tracing stop)
```

I attached gdb to rr itself, to investigate why it got stuck into this state. rr is tracking both of these two threads still:

```
(gdb) frame 5
#5  0x00000000004dab3c in rr::RecordSession::record_step (this=0x24892fa0) at /root/rr/src/RecordSession.cc:2608
warning: 2608   /root/rr/src/RecordSession.cc: No such file or directory
(gdb) print this->task_map
$1 = std::map with 2 elements = {[1675] = 0x24918580, [1681] = 0x24936c50}
```

The thread-group-leader cannot be reaped because rr needs to reap the other threads first (https://github.com/rr-debugger/rr/blob/f7067f1c208556526a1a6f4eeeac200bebb2955a/src/RecordTask.cc#L2146-L2151)

```
(gdb) print ((rr::RecordTask*)(0x24918580))->waiting_for_reap
$7 = true
(gdb) print ((rr::RecordTask*)(0x24918580))->may_reap()
$8 = false
```

so the question we need to investigate is, why is the thread 1681 not yet reaped? We've seen it's PTRACE_EVENT_EXIT - that's the tracing-stop we're currently in. `(394623 >> 16) & 0xFF == 6` which is `PTRACE_EVENT_EXIT` - https://elixir.bootlin.com/linux/v6.11.6/source/include/uapi/linux/ptrace.h#L161

```
(gdb) print ((rr::RecordTask*)(0x24936c50))->wait_status
$13 = {status = 394623}
```

If we look at some of its instance variables:

```
(gdb) print ((rr::RecordTask*)(0x24936c50))->seen_ptrace_exit_event_
$18 = true
(gdb) print ((rr::RecordTask*)(0x24936c50))->handled_ptrace_exit_event_
$19 = false
(gdb) print ((rr::RecordTask*)(0x24936c50))->was_reaped_
$20 = false
(gdb) print ((rr::RecordTask*)(0x24936c50))->in_unexpected_exit
$21 = false
(gdb) print (('rr::RecordTask'*)(0x24936c50))->emulated_stop_type
$3 = rr::NOT_STOPPED
(gdb) print (('rr::RecordTask'*)(0x24936c50))->is_stopped_
$4 = true
(gdb) print (('rr::RecordTask'*)(0x24936c50))->waiting_for_ptrace_exit
$5 = true
```
There are only two places that `seen_ptrace_exit_event_` can be set:

* https://github.com/rr-debugger/rr/blob/f7067f1c208556526a1a6f4eeeac200bebb2955a/src/Task.cc#L483 - but if this line had been hit, then `handled_ptrace_exit_event_` would also be true, which it's not
* https://github.com/rr-debugger/rr/blob/f7067f1c208556526a1a6f4eeeac200bebb2955a/src/Task.cc#L2420 - so it must have been set here (and hitting this line implies `was_reaped_` is false too, which it is).

rr itself is blocked in this call to `waitid(-1)`:

```
(gdb) bt
#0  0x00007fa2c585807d in __waitid (idtype=P_ALL, id=4294967295, infop=0x7ffd00c63eb0, options=1073741830) at ../sysdeps/unix/sysv/linux/waitid.c:29
#1  0x00000000005bafef in rr::WaitState::do_wait (this=this@entry=0x77c2c0 <_ZZN2rrL10wait_stateEvE12static_state.lto_priv.0>, tid=-1, consume=<optimized out>, type=type@entry=6, block_seconds=<optimized out>, status=...) at /root/rr/src/WaitManager.cc:58
#2  0x00000000005bb3b3 in rr::WaitState::wait (this=0x77c2c0 <_ZZN2rrL10wait_stateEvE12static_state.lto_priv.0>, options=..., type=6) at /root/rr/src/WaitManager.cc:114
#3  0x000000000055f597 in rr::wait_any (tid=@0x7ffd00c640b0: 1095914576, status=..., timeout=<optimized out>) at /root/rr/src/Scheduler.cc:569
#4  0x0000000000560665 in rr::Scheduler::reschedule (this=0x248931f0, switchable=<optimized out>) at /root/rr/src/Scheduler.cc:926
#5  0x00000000004dab3c in rr::RecordSession::record_step (this=0x24892fa0) at /root/rr/src/RecordSession.cc:2608
#6  0x00000000004ccd33 in rr::record (args=..., flags=...) at /root/rr/src/RecordCommand.cc:721
#7  rr::RecordCommand::run (this=<optimized out>, args=...) at /root/rr/src/RecordCommand.cc:884
#8  0x000000000041cf90 in main (argc=<optimized out>, argv=<optimized out>) at /root/rr/src/main.cc:278
```

This can't make any forward progress, because what needs to happen (I think) is that we need to continue thread 1681 so it leaves the ptrace-stop and makes it to a zombie state, which will generate another wait notification with the status and allow everything to be cleaned up.

I'm not entirely sure how we got into this state, but i suspect the fact that `waiting_for_ptrace_exit == true` is part of the story - that gets set when a core-dumping signal is sent to the process (https://github.com/rr-debugger/rr/blob/f7067f1c208556526a1a6f4eeeac200bebb2955a/src/RecordSession.cc#L1884).

I'll try and debug this a bit further this weekend but I wondered if this triggered any spidy-senses about what to look for here :thinking: 


	for (auto it : thread_group()->task_set()) {
	if (&*it != this) {
	LOG(debug) << "Declining to reap " << tid << "; leader of non-empty thread-group with active thread " << it->tid;
	return false;
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rr --wait stuck not reaping zombie process #3882

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rr --wait stuck not reaping zombie process #3882

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions