-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hanging test due to interlocking(?) #53215
base: main
Are you sure you want to change the base?
Conversation
c7c3aa7
to
7a6d555
Compare
I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that. |
No you're right. I made a guess but I was wrong. For example on the CI build for this branch, I can see the test/engine/commands_test.rb test file has 6 runs, which matches.(here) Now the question is, why do those tests do not hang on CI? |
I'm unable to reproduce it locally:
I wonder if it has something to do with the
|
@zzak I just rebooted my laptop to be sure. Haven't got any Rails app running. (macOS Sequoia, Ruby 3.3.5) While the test is hanging, I can see ruby listening to the port 3000, which disappears as soon as I kill the hanging test: $ lsof -i -P | grep LISTEN | grep ruby
ruby 4656 sto 7u IPv4 0x3ba087057954dfb8 0t0 TCP localhost:3000 (LISTEN)
ruby 4656 sto 8u IPv6 0xc7ccd4d0b84fd561 0t0 TCP localhost:3000 (LISTEN)
ruby 4656 sto 9u IPv4 0x4593af10ba1eeabc 0t0 TCP localhost:3000 (LISTEN)
ruby 4656 sto 10u IPv4 0x405f2a704fdff54e 0t0 TCP connectivity-check.warp-svc:3000 (LISTEN)
ruby 4656 sto 11u IPv4 0xe41de747218cb552 0t0 TCP connectivity-check.warp-svc:3000 (LISTEN) Update: tried on a different Mac. Same. Will try on a Linux machine soon. Update again: took a while to set up and run, but yeah, this test passes on Ubuntu 24.04 with Ruby 3.3.5. Takes something like 13 seconds though. |
I'm on linux, but I remember there being some differences in the native threads they used. |
Yep, seems to be a macOS issue, since I managed to get it to work on Linux too. |
@davidstosik - if you add a signal handler that dumps the thread backtraces, it might help track down where the test is hanging. Something like: Signal.trap "SIGPROF" do
Thread.list.each do |thread|
puts thread.name;
puts thread.backtrace;
puts
end
end Then send a PROF signal to the test process when it is hanging. |
Thanks @djmb! Signal.trap "SIGPROF" do
Thread.list.each do |thread|
puts "🌊" * 10
puts thread.name
puts thread.backtrace
puts
end
end Ran the test until it hands, then executed
(Most threads don't seem to have a name., but at least I have traces.) Looking at the first thread's backtrace:
It does look like the test is stuck on |
I wonder if just adding a sleep after spawning the process also solves it for you, like the process may have output "Listening on..." but it's not accepting connections yet causing the request to hang. |
@zzak I tried adding I'm still not pretending I understand whatever's happening with I'm not suggesting we should use it, but it might help understand where the problem is? diff --git i/railties/test/console_helpers.rb w/railties/test/console_helpers.rb
index 10d3a54602..b871051aad 100644
--- i/railties/test/console_helpers.rb
+++ w/railties/test/console_helpers.rb
@@ -12,7 +12,7 @@ def assert_output(expected, io, timeout = 10)
output = +""
until output.include?(expected) || Time.now > timeout
if IO.select([io], [], [], 0.1)
- output << io.read(1)
+ output << (io.read(1) || "")
end
end
diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..1289759de3 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,23 @@ def test_server_command_work_inside_engine
end
def test_server_command_broadcast_logs
primary, replica = PTY.open
- pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
- assert_output("Listening on", primary)
+ out_io = File.open("test_out", "w")
+ pid = spawn_command(
+ "server", replica, env: { "RAILS_ENV" => "development" },
+ out_io: out_io
+ )
+
+ assert_output("Listening on", File.open("test_out", "r"))
Net::HTTP.new("127.0.0.1", 3000).tap do |net|
net.get("/")
@@ -73,7 +87,7 @@ def test_server_command_broadcast_logs
assert_match("Processing by Rails::WelcomeController", logs)
end
- assert_output("Processing by Rails::WelcomeController", primary)
+ assert_output("Processing by Rails::WelcomeController", File.open("test_out", "r"))
ensure
kill(pid)
end
@@ -84,9 +98,10 @@ def plugin_path
"#{@destination_root}/bukkits"
end
- def spawn_command(command, fd, env: {})
+ def spawn_command(command, fd, env: {}, out_io: nil)
+ out_io ||= fd
in_plugin_context(plugin_path) do
- Process.spawn(env, "bin/rails #{command}", in: fd, out: fd, err: fd)
+ Process.spawn(env, "bin/rails #{command}", in: fd, out: out_io, err: fd)
end
end As you can see, if I spawn the process to write into a file instead of the |
I wrote a lot more than what's below, but now I've identified the issue more clearly, what's next should suffice. If you want to see the process that got me here, you can expand this.Here's another experiment: on diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..a740054139 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,18 @@ def test_server_command_work_inside_engine
end
def test_server_command_broadcast_logs
+ in_plugin_context(plugin_path) do
+ puts
+ puts `pwd`.strip
+ puts "Press Enter to continue"
+ $stdin.gets
+ end
+
primary, replica = PTY.open
pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
+
assert_output("Listening on", primary)
+ puts "Press Enter to continue"
Net::HTTP.new("127.0.0.1", 3000).tap do |net|
net.get("/") It outputs the path to the dummy app used in the test, then pauses the test in two places:
This is how it goes:
I don't know what it is, but it appears that the Rails server receives the request, renders its response, but is unable to complete? If, instead of resuming the test execution after the second pause, I browse to Important TL;DR: macOS's pseudo terminal internal buffer size (that's a mouthful) is only 1kB, and that can't even accommodate the output generated by rails server booting and serving a single request. Here's one last experiment. diff --git c/railties/test/engine/commands_test.rb i/railties/test/engine/commands_test.rb
index 3e888ce0d8..1781650704 100644
--- c/railties/test/engine/commands_test.rb
+++ i/railties/test/engine/commands_test.rb
@@ -64,6 +64,8 @@ def test_server_command_broadcast_logs
pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
assert_output("Listening on", primary)
+ primary.read(196)
+
Net::HTTP.new("127.0.0.1", 3000).tap do |net|
net.get("/")
end This is another one of those things that are not a proper fix, but help understanding what's happening. I empties primary buffer from all the stuff Rails outputs at startup, and apparently, that's all Rails needs to be able to continue serving the request and completing the response, making the test pass successfully. Understanding a bit more about the issue, I wrote this simple Ruby script: require "pty"
(500..).each do |size|
size = size*10
puts "Spawning a process that outputs #{size} characters."
primary, replica = PTY.open
pid = Process.spawn("echo -n #{'a' * size}", in: replica, out: replica, err: replica)
Process.wait(pid)
sleep 0.1
end Running this script will keep spawning an So PTY has a 1kB buffer and hangs when that buffer is full? Is this a macOS limitation? What would be that number on Linux? Can we change it? I found this superuser post: Ptys (Pseudo terminals) internal buffer size, which as script that confirms my assumptions:
Going back to our test, here's what the Rails server outputs:
Calling So what can we do?
In my situation however, the size of the Rails server output is so big (it depends on the paths where the partials are), that even not including I tried cloning the Rails repository in a shorter path, to see if my long paths were to blame: gh repo clone rails/rails /tmp/rails
cd /tmp/rails/railties
bin/test -v test/engine/commands_test.rb:62 Unfortunately, the output is still too long without a fix, and the test hangs. It looks like all the fix ideas I could come up with above are limited, and not guaranteed to continue to work in the future:
In conclusion, I think the fix I'm suggesting in this PR is reasonable: run the HTTP request in a separate thread so we can keep consuming the rails server process' output as it is produced. |
@davidstosik Thanks for your detailed investigation. I was able to reproduce the issue using a macos runner on github actions, and indeed your fix to use a thread to make the http request works well: Here is my config for reference:name: Debug railtie test on macos
on:
push:
workflow_dispatch:
jobs:
test:
runs-on: macos-latest
steps:
- name: checkout
uses: actions/checkout@v3
- name: setup-ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 3.3
bundler-cache: true
env:
BUNDLE_WITHOUT: "db:job"
- name: run test
env:
BUNDLE_WITHOUT: "db:job"
run: |
cd railties
bin/test test/engine/commands_test.rb:62 It indeed seems like this issue is due to pty, and there are few tests which rely on it, none of them are making a consecutive http request. I was thinking this might affect other tests, or that we should try to make it re-useable in case this comes up again. But after running the majority of the tests on that runner, and checking |
We can generalize if/when we find other hanging tests later. 👍🏻 |
Net::HTTP.new("127.0.0.1", 3000).tap do |net| | ||
net.get("/") | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh by the way, why was this not simply?
Net::HTTP.get("127.0.0.1", "/", 3000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but @Edouard-chin added it in #49417, so maybe he has more context on this particular test. I'm surprised he did't run into this, but maybe was not using a mac at the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great investigation, TIL about pty and the limitation on various platform ! Sorry I introduced the bug 😭 , I didn't realise as I didn't encounter it, strangely, even now on a mac.
Oh by the way, why was this not simply? [...]
No particular reason, feel free to change it!
For a suggestion: while your fix is great, I realise now that I don't think this test needs PTY (and the rest of this test suit neither I believe), it just needs a pair of connected IO. I tried a few dozen times, and it works correctly, but please try on your machine
diff --git a/railties/test/engine/commands_test.rb b/railties/test/engine/commands_test.rb
index 3e888ce0d8..97c7a6d1f0 100644
--- a/railties/test/engine/commands_test.rb
+++ b/railties/test/engine/commands_test.rb
@@ -60,9 +60,10 @@ def test_server_command_work_inside_engine
end
def test_server_command_broadcast_logs
- primary, replica = PTY.open
- pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
- assert_output("Listening on", primary)
+ read, write = IO.pipe
+ pid = spawn_command("server", write, env: { "RAILS_ENV" => "development" })
+
+ assert_output("Listening on", read)
Net::HTTP.new("127.0.0.1", 3000).tap do |net|
net.get("/")
@@ -73,7 +74,7 @@ def test_server_command_broadcast_logs
assert_match("Processing by Rails::WelcomeController", logs)
end
- assert_output("Processing by Rails::WelcomeController", primary)
+ assert_output("Processing by Rails::WelcomeController", read)
ensure
kill(pid)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Edouard-chin Thanks for your comment!
I didn't encounter it, strangely, even now on a mac.
That is indeed strange! Have you tried running this ptsbufsize.py
script locally?
Here's a Ruby version too if that's easier to run.
#!/usr/bin/env ruby
require "pty"
require "timeout"
primary, replica = PTY.open
puts "Using devices #{primary.path} ; #{replica.path} as PTY."
begin
bytes_written = 0
loop do
Timeout.timeout(1) do
replica.write("a")
end
bytes_written += 1
end
rescue Timeout::Error
puts "Timed out after writing #{bytes_written} bytes."
end
On my machine it stops and outputs this:
Using devices masterpty:/dev/ttys010 ; /dev/ttys010 as PTY.
Timed out after writing 1024 bytes.
I tested in Terminal.app, iTerm2 and Alacritty, in and out of tmux, and always got the same result.
(Supposedly, those don't make a difference anyway, since PTY
would be using macOS' TTY/PTY devices...)
I'll try out your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Edouard-chin Thanks again for the suggestion!
I applied the changes you suggested and it looks like the test still works. That also means we don't need to check available_pty?
for this specific test.
After looking at all tests in the file, it looks like only the dbconsole
one actually needs to run in a pseudo-terminal.
All the others seem to be doing just fine with IO.pipe
, so I changed them to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the script, I can reproduce with it 100% of the time with it.
I also wrote a script to run the test in a loop and got it to hang after the test ran 4 times. I suppose there is a race condition which makes the test succeed if the buffer reads the "Listening on" (and thus add available space to the buffer) before the get request is hit.
7a6d555
to
086cb0b
Compare
086cb0b
to
2da5499
Compare
primary.puts "quit" | ||
end | ||
def test_console_command_work_inside_engine | ||
read, write = IO.pipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to keep using PTY
for this test too, since it's opening an interactive Rails console?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's probably more agnostic to IRB's behaviour (and ultimately a closer representation of user experience) to keep it here... though honestly pretty unlikely to ever come up as a concern either way.
Great investigation on this! I do think we should still go with a thread wrapping the |
2da5499
to
c3a49a6
Compare
@matthewd Thank you for the review! Happy to restore the earlier threaded approach if that still feels better. |
Update: I removed the use of `PTY` for the tests that don't need it. The original issue was caused by the buffer size of a pseudo terminal being too small on macOS (1kB). Note that the `dbconsole` test still use `PTY`, as it appears SQLite requires a proper terminal to run correctly. Other tests seem to fare well with `IO.pipe` IOs, reducing complexity. --- Alright, I'm not going to pretend that I understand exactly what's happening, but the fact is that this seems to fix something, so I'll try to explain what I understand. ## The problem Running the following test hangs forever: ```sh-session $ bin/test test/engine/commands_test.rb:62 Run options: -v --seed 4349 # Running: Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here ``` Somehow I seem to have reproduced it with the command that would run on CI: ```sh-session $ TEST_DIR=engine TESTOPTS="-v -n test_server_command_broadcast_logs" bundle exec rake test --- test/engine/commands_test.rb /Users/sto/.rubies/ruby-3.3.5/bin/ruby -w -Itest -Ilib -I../activesupport/lib -I../actionpack/lib -I../actionview/lib -I../activemodel/lib test/engine/commands_test.rb Run options: -v -n test_server_command_broadcast_logs --seed 59220 # Running: Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here ``` I'm gonna venture a guess and assume that maybe this test (and 3 others) do not run in CI because of this condition: https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/engine/commands_test.rb#L36 https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/console_helpers.rb#L23-L25 Is there a way we can check if these tests ever ran on CI? ## What seems to be happening I do not understand much of how `PTY` works, but what I have identified is that 1. The test hangs at `net.get("/")`. Using a bit of debugging, I confirmed that I was not able to `curl http://localhost:3000` after spawning `rails server`. 2. It appears the process is entering a lock where: - it will not continue processing and complete the request until `rails server`'s output has been "consumed" (`read`, `readlines`) out of the `IO` object returned by `PTY.new`. - we won't be reading more output out of `primary` since `net.get("/")` is blocked ## The workaround I have no idea whether this is an acceptable fix, but parallelizing, and allowing our test thread to try and read (blocking) the PTY's output, while a different thread makes the HTTP request (blocking too), seems to be fixing the problem. You can try running any of the two commands I shared above and confirm the test will succeed now. ## Questions - Is there a way we can check if these tests ever ran on CI? - Is my fix an acceptable approach? - Has this test ever worked in the past? (Locally/On CI) - Has anything changed (maybe in Puma), making Puma's output blocking, and preventing it to continue processing a request until all its output has been consumed?
c3a49a6
to
54775e8
Compare
Alright, I'm not going to pretend that I understand exactly what's happening, but the fact is that this seems to fix something, so I'll try to explain what I understand.
The problem
Running the following test hangs forever:
Somehow I seem to have reproduced it with the command that would run on
CI:
I'm gonna venture a guess and assume that maybe this test (and 3 others) do not run in CI because of this condition:
rails/railties/test/engine/commands_test.rb
Line 36 in aacbb5c
rails/railties/test/console_helpers.rb
Lines 23 to 25 in aacbb5c
Is there a way we can check if these tests ever ran on CI?
What seems to be happening
I do not understand much of how
PTY
works, but what I have identified is that:net.get("/")
. Using a bit of debugging, I confirmed that I was not able tocurl http://localhost:3000
after spawningrails server
.rails server
's output has been "consumed" (read
,readlines
) out of theIO
object returned byPTY.new
.primary
sincenet.get("/")
is blockedThe workaround
I have no idea whether this is an acceptable fix, but parallelizing, and allowing our test thread to try and read (blocking) the PTY's output, while a different thread makes the HTTP request (blocking too), seems to be fixing the problem.
You can try running any of the two commands I shared above and confirm the test will succeed now.
Questions
Motivation / Background
This Pull Request has been created because [REPLACE ME]
Checklist
Before submitting the PR make sure the following are checked:
[Fix #issue-number]