-
Notifications
You must be signed in to change notification settings - Fork 641
Description
I've got a very occasional issue where rr --wait is not properly reaping some child processes that have exited, and thus rr itself is stuck forever and not exiting.
The situation looks like this. There's an rr process, pid 1566, which has no traced children except for the defunct zombie process 1675.
[root@2afe6f2c30c3 /]# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1759 0.0 0.0 4984 3760 ? S 06:27 0:00 -bash
root 1788 0.0 0.0 5796 3648 ? R+ 06:27 0:00 \_ ps auxf
minipc-+ 1 0.0 0.0 2416 1240 ? Ss Nov13 0:00 /usr/bin/dumb-init -- ../build-ruby.rb --test-tool --rr --asan --pernosco
minipc-+ 7 0.0 0.3 1889412 198044 ? Ssl Nov13 0:53 /usr/bin/ruby-mri ../build-ruby.rb --test-tool --rr --asan --pernosco
minipc-+ 1566 0.0 0.1 933848 126316 ? tl Nov13 0:08 \_ rr record --output-trace-dir test_output_dir/tool__test__testunit__test_parallel/rr_trace --asan --wait --disable-avx-512 --disable-cpuid-features 0x80050440,0x40140400 --disable-cpuid-features-ext 0xc405814,0xe73fa021,0x3eff8ef --disable-cpuid-
minipc-+ 1675 0.0 0.0 0 0 ? Zl Nov13 0:00 [ruby] <defunct>
Process 1675 has two threads. The thread-group-leader is in a zombie state, but the second thread is in a ptrace-stop:
[root@2afe6f2c30c3 /]# cat /proc/1675/status
Name: ruby
State: Z (zombie)
...
[root@2afe6f2c30c3 /]# cat /proc/1675/task/1681/status
Name: parallel.rb:40
Umask: 0022
State: t (tracing stop)
I attached gdb to rr itself, to investigate why it got stuck into this state. rr is tracking both of these two threads still:
(gdb) frame 5
#5 0x00000000004dab3c in rr::RecordSession::record_step (this=0x24892fa0) at /root/rr/src/RecordSession.cc:2608
warning: 2608 /root/rr/src/RecordSession.cc: No such file or directory
(gdb) print this->task_map
$1 = std::map with 2 elements = {[1675] = 0x24918580, [1681] = 0x24936c50}
The thread-group-leader cannot be reaped because rr needs to reap the other threads first (
Lines 2146 to 2151 in f7067f1
| for (auto it : thread_group()->task_set()) { | |
| if (&*it != this) { | |
| LOG(debug) << "Declining to reap " << tid << "; leader of non-empty thread-group with active thread " << it->tid; | |
| return false; | |
| } | |
| } |
(gdb) print ((rr::RecordTask*)(0x24918580))->waiting_for_reap
$7 = true
(gdb) print ((rr::RecordTask*)(0x24918580))->may_reap()
$8 = false
so the question we need to investigate is, why is the thread 1681 not yet reaped? We've seen it's PTRACE_EVENT_EXIT - that's the tracing-stop we're currently in. (394623 >> 16) & 0xFF == 6 which is PTRACE_EVENT_EXIT - https://elixir.bootlin.com/linux/v6.11.6/source/include/uapi/linux/ptrace.h#L161
(gdb) print ((rr::RecordTask*)(0x24936c50))->wait_status
$13 = {status = 394623}
If we look at some of its instance variables:
(gdb) print ((rr::RecordTask*)(0x24936c50))->seen_ptrace_exit_event_
$18 = true
(gdb) print ((rr::RecordTask*)(0x24936c50))->handled_ptrace_exit_event_
$19 = false
(gdb) print ((rr::RecordTask*)(0x24936c50))->was_reaped_
$20 = false
(gdb) print ((rr::RecordTask*)(0x24936c50))->in_unexpected_exit
$21 = false
(gdb) print (('rr::RecordTask'*)(0x24936c50))->emulated_stop_type
$3 = rr::NOT_STOPPED
(gdb) print (('rr::RecordTask'*)(0x24936c50))->is_stopped_
$4 = true
(gdb) print (('rr::RecordTask'*)(0x24936c50))->waiting_for_ptrace_exit
$5 = true
There are only two places that seen_ptrace_exit_event_ can be set:
- - but if this line had been hit, then
Line 483 in f7067f1
seen_ptrace_exit_event_ = true; handled_ptrace_exit_event_would also be true, which it's not - - so it must have been set here (and hitting this line implies
Line 2420 in f7067f1
seen_ptrace_exit_event_ = true; was_reaped_is false too, which it is).
rr itself is blocked in this call to waitid(-1):
(gdb) bt
#0 0x00007fa2c585807d in __waitid (idtype=P_ALL, id=4294967295, infop=0x7ffd00c63eb0, options=1073741830) at ../sysdeps/unix/sysv/linux/waitid.c:29
#1 0x00000000005bafef in rr::WaitState::do_wait (this=this@entry=0x77c2c0 <_ZZN2rrL10wait_stateEvE12static_state.lto_priv.0>, tid=-1, consume=<optimized out>, type=type@entry=6, block_seconds=<optimized out>, status=...) at /root/rr/src/WaitManager.cc:58
#2 0x00000000005bb3b3 in rr::WaitState::wait (this=0x77c2c0 <_ZZN2rrL10wait_stateEvE12static_state.lto_priv.0>, options=..., type=6) at /root/rr/src/WaitManager.cc:114
#3 0x000000000055f597 in rr::wait_any (tid=@0x7ffd00c640b0: 1095914576, status=..., timeout=<optimized out>) at /root/rr/src/Scheduler.cc:569
#4 0x0000000000560665 in rr::Scheduler::reschedule (this=0x248931f0, switchable=<optimized out>) at /root/rr/src/Scheduler.cc:926
#5 0x00000000004dab3c in rr::RecordSession::record_step (this=0x24892fa0) at /root/rr/src/RecordSession.cc:2608
#6 0x00000000004ccd33 in rr::record (args=..., flags=...) at /root/rr/src/RecordCommand.cc:721
#7 rr::RecordCommand::run (this=<optimized out>, args=...) at /root/rr/src/RecordCommand.cc:884
#8 0x000000000041cf90 in main (argc=<optimized out>, argv=<optimized out>) at /root/rr/src/main.cc:278
This can't make any forward progress, because what needs to happen (I think) is that we need to continue thread 1681 so it leaves the ptrace-stop and makes it to a zombie state, which will generate another wait notification with the status and allow everything to be cleaned up.
I'm not entirely sure how we got into this state, but i suspect the fact that waiting_for_ptrace_exit == true is part of the story - that gets set when a core-dumping signal is sent to the process (
Line 1884 in f7067f1
| ((RecordTask *)ot)->waiting_for_ptrace_exit = true; |
I'll try and debug this a bit further this weekend but I wondered if this triggered any spidy-senses about what to look for here 🤔