Skip to content

rr --wait stuck not reaping zombie process #3882

@KJTsanaktsidis

Description

@KJTsanaktsidis

I've got a very occasional issue where rr --wait is not properly reaping some child processes that have exited, and thus rr itself is stuck forever and not exiting.

The situation looks like this. There's an rr process, pid 1566, which has no traced children except for the defunct zombie process 1675.

[root@2afe6f2c30c3 /]# ps auxf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1759  0.0  0.0   4984  3760 ?        S    06:27   0:00 -bash
root        1788  0.0  0.0   5796  3648 ?        R+   06:27   0:00  \_ ps auxf
minipc-+       1  0.0  0.0   2416  1240 ?        Ss   Nov13   0:00 /usr/bin/dumb-init -- ../build-ruby.rb --test-tool --rr --asan --pernosco
minipc-+       7  0.0  0.3 1889412 198044 ?      Ssl  Nov13   0:53 /usr/bin/ruby-mri ../build-ruby.rb --test-tool --rr --asan --pernosco
minipc-+    1566  0.0  0.1 933848 126316 ?       tl   Nov13   0:08  \_ rr record --output-trace-dir test_output_dir/tool__test__testunit__test_parallel/rr_trace --asan --wait --disable-avx-512 --disable-cpuid-features 0x80050440,0x40140400 --disable-cpuid-features-ext 0xc405814,0xe73fa021,0x3eff8ef --disable-cpuid-
minipc-+    1675  0.0  0.0      0     0 ?        Zl   Nov13   0:00 [ruby] <defunct>

Process 1675 has two threads. The thread-group-leader is in a zombie state, but the second thread is in a ptrace-stop:

[root@2afe6f2c30c3 /]# cat /proc/1675/status
Name:   ruby
State:  Z (zombie)
...
[root@2afe6f2c30c3 /]# cat /proc/1675/task/1681/status 
Name:   parallel.rb:40
Umask:  0022
State:  t (tracing stop)

I attached gdb to rr itself, to investigate why it got stuck into this state. rr is tracking both of these two threads still:

(gdb) frame 5
#5  0x00000000004dab3c in rr::RecordSession::record_step (this=0x24892fa0) at /root/rr/src/RecordSession.cc:2608
warning: 2608   /root/rr/src/RecordSession.cc: No such file or directory
(gdb) print this->task_map
$1 = std::map with 2 elements = {[1675] = 0x24918580, [1681] = 0x24936c50}

The thread-group-leader cannot be reaped because rr needs to reap the other threads first (

rr/src/RecordTask.cc

Lines 2146 to 2151 in f7067f1

for (auto it : thread_group()->task_set()) {
if (&*it != this) {
LOG(debug) << "Declining to reap " << tid << "; leader of non-empty thread-group with active thread " << it->tid;
return false;
}
}
)

(gdb) print ((rr::RecordTask*)(0x24918580))->waiting_for_reap
$7 = true
(gdb) print ((rr::RecordTask*)(0x24918580))->may_reap()
$8 = false

so the question we need to investigate is, why is the thread 1681 not yet reaped? We've seen it's PTRACE_EVENT_EXIT - that's the tracing-stop we're currently in. (394623 >> 16) & 0xFF == 6 which is PTRACE_EVENT_EXIT - https://elixir.bootlin.com/linux/v6.11.6/source/include/uapi/linux/ptrace.h#L161

(gdb) print ((rr::RecordTask*)(0x24936c50))->wait_status
$13 = {status = 394623}

If we look at some of its instance variables:

(gdb) print ((rr::RecordTask*)(0x24936c50))->seen_ptrace_exit_event_
$18 = true
(gdb) print ((rr::RecordTask*)(0x24936c50))->handled_ptrace_exit_event_
$19 = false
(gdb) print ((rr::RecordTask*)(0x24936c50))->was_reaped_
$20 = false
(gdb) print ((rr::RecordTask*)(0x24936c50))->in_unexpected_exit
$21 = false
(gdb) print (('rr::RecordTask'*)(0x24936c50))->emulated_stop_type
$3 = rr::NOT_STOPPED
(gdb) print (('rr::RecordTask'*)(0x24936c50))->is_stopped_
$4 = true
(gdb) print (('rr::RecordTask'*)(0x24936c50))->waiting_for_ptrace_exit
$5 = true

There are only two places that seen_ptrace_exit_event_ can be set:

  • rr/src/Task.cc

    Line 483 in f7067f1

    seen_ptrace_exit_event_ = true;
    - but if this line had been hit, then handled_ptrace_exit_event_ would also be true, which it's not
  • rr/src/Task.cc

    Line 2420 in f7067f1

    seen_ptrace_exit_event_ = true;
    - so it must have been set here (and hitting this line implies was_reaped_ is false too, which it is).

rr itself is blocked in this call to waitid(-1):

(gdb) bt
#0  0x00007fa2c585807d in __waitid (idtype=P_ALL, id=4294967295, infop=0x7ffd00c63eb0, options=1073741830) at ../sysdeps/unix/sysv/linux/waitid.c:29
#1  0x00000000005bafef in rr::WaitState::do_wait (this=this@entry=0x77c2c0 <_ZZN2rrL10wait_stateEvE12static_state.lto_priv.0>, tid=-1, consume=<optimized out>, type=type@entry=6, block_seconds=<optimized out>, status=...) at /root/rr/src/WaitManager.cc:58
#2  0x00000000005bb3b3 in rr::WaitState::wait (this=0x77c2c0 <_ZZN2rrL10wait_stateEvE12static_state.lto_priv.0>, options=..., type=6) at /root/rr/src/WaitManager.cc:114
#3  0x000000000055f597 in rr::wait_any (tid=@0x7ffd00c640b0: 1095914576, status=..., timeout=<optimized out>) at /root/rr/src/Scheduler.cc:569
#4  0x0000000000560665 in rr::Scheduler::reschedule (this=0x248931f0, switchable=<optimized out>) at /root/rr/src/Scheduler.cc:926
#5  0x00000000004dab3c in rr::RecordSession::record_step (this=0x24892fa0) at /root/rr/src/RecordSession.cc:2608
#6  0x00000000004ccd33 in rr::record (args=..., flags=...) at /root/rr/src/RecordCommand.cc:721
#7  rr::RecordCommand::run (this=<optimized out>, args=...) at /root/rr/src/RecordCommand.cc:884
#8  0x000000000041cf90 in main (argc=<optimized out>, argv=<optimized out>) at /root/rr/src/main.cc:278

This can't make any forward progress, because what needs to happen (I think) is that we need to continue thread 1681 so it leaves the ptrace-stop and makes it to a zombie state, which will generate another wait notification with the status and allow everything to be cleaned up.

I'm not entirely sure how we got into this state, but i suspect the fact that waiting_for_ptrace_exit == true is part of the story - that gets set when a core-dumping signal is sent to the process (

((RecordTask *)ot)->waiting_for_ptrace_exit = true;
).

I'll try and debug this a bit further this weekend but I wondered if this triggered any spidy-senses about what to look for here 🤔

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions