Fix hanging test due to interlocking(?) #53215

davidstosik · 2024-10-07T08:55:46Z

Alright, I'm not going to pretend that I understand exactly what's happening, but the fact is that this seems to fix something, so I'll try to explain what I understand.

The problem

Running the following test hangs forever:

$ bin/test test/engine/commands_test.rb:62
Run options: -v --seed 4349

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here

Somehow I seem to have reproduced it with the command that would run on
CI:

$ TEST_DIR=engine TESTOPTS="-v -n test_server_command_broadcast_logs"  bundle exec rake test

--- test/engine/commands_test.rb
/Users/sto/.rubies/ruby-3.3.5/bin/ruby -w -Itest -Ilib -I../activesupport/lib -I../actionpack/lib -I../actionview/lib -I../activemodel/lib test/engine/commands_test.rb
Run options: -v -n test_server_command_broadcast_logs --seed 59220

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here

I'm gonna venture a guess and assume that maybe this test (and 3 others) do not run in CI because of this condition:

rails/railties/test/engine/commands_test.rb

Line 36 in aacbb5c

if available_pty?

rails/railties/test/console_helpers.rb

Lines 23 to 25 in aacbb5c

    
           def available_pty? 
        
             defined?(PTY) && PTY.respond_to?(:open) 
        
           end

Is there a way we can check if these tests ever ran on CI?

What seems to be happening

I do not understand much of how PTY works, but what I have identified is that:

The test hangs at net.get("/"). Using a bit of debugging, I confirmed that I was not able to curl http://localhost:3000 after spawning rails server.
It appears the process is entering a lock where:
- it will not continue processing and complete the request until rails server's output has been "consumed" (read, readlines) out of the IO object returned by PTY.new.
- we won't be reading more output out of primary since net.get("/") is blocked

The workaround

I have no idea whether this is an acceptable fix, but parallelizing, and allowing our test thread to try and read (blocking) the PTY's output, while a different thread makes the HTTP request (blocking too), seems to be fixing the problem.

You can try running any of the two commands I shared above and confirm the test will succeed now.

Questions

Is there a way we can check if these tests ever ran on CI?
Is my fix an acceptable approach?
Has this test ever worked in the past? (Locally/On CI)
Has anything changed (maybe in Puma), making Puma's output blocking, and preventing it to continue processing a request until all its output has been consumed?

Motivation / Background

This Pull Request has been created because [REPLACE ME]

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

zzak · 2024-10-19T03:57:21Z

I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that.

davidstosik · 2024-10-19T13:08:49Z

I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that.

No you're right. I made a guess but I was wrong. For example on the CI build for this branch, I can see the test/engine/commands_test.rb test file has 6 runs, which matches.(here)

Now the question is, why do those tests do not hang on CI?
@zzak I'm sure you've you tried running them locally before, do they still run on your machine?

zzak · 2024-10-19T23:16:20Z

I'm unable to reproduce it locally:

bin/test -v test/engine/commands_test.rb:62
Run options: -v --seed 6467

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 2.14 s = .

Finished in 2.145408s, 0.4661 runs/s, 2.7967 assertions/s.
1 runs, 6 assertions, 0 failures, 0 errors, 0 skips

I wonder if it has something to do with the spawn_command in that file being a unique case, compared to the other tests. That is just a hunch though.

rails/railties/test/isolation/abstract_unit.rb

Line 345 in b8720e7

def rails(*args, allow_failure: false, stderr: false)

davidstosik · 2024-10-21T07:16:37Z

@zzak I just rebooted my laptop to be sure. Haven't got any Rails app running.
The test still hangs.

(macOS Sequoia, Ruby 3.3.5)

While the test is hanging, I can see ruby listening to the port 3000, which disappears as soon as I kill the hanging test:

$  lsof -i -P | grep LISTEN | grep ruby
ruby      4656  sto    7u  IPv4 0x3ba087057954dfb8      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto    8u  IPv6 0xc7ccd4d0b84fd561      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto    9u  IPv4 0x4593af10ba1eeabc      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto   10u  IPv4 0x405f2a704fdff54e      0t0  TCP connectivity-check.warp-svc:3000 (LISTEN)
ruby      4656  sto   11u  IPv4 0xe41de747218cb552      0t0  TCP connectivity-check.warp-svc:3000 (LISTEN)

Update: tried on a different Mac. Same. Will try on a Linux machine soon.

Update again: took a while to set up and run, but yeah, this test passes on Ubuntu 24.04 with Ruby 3.3.5. Takes something like 13 seconds though.

zzak · 2024-10-21T11:29:21Z

I'm on linux, but I remember there being some differences in the native threads they used.
Worst case, we can try to spike on a macos runner on GH actions to get it to repro.

davidstosik · 2024-10-21T12:13:50Z

Yep, seems to be a macOS issue, since I managed to get it to work on Linux too.

djmb · 2024-10-22T10:05:12Z

@davidstosik - if you add a signal handler that dumps the thread backtraces, it might help track down where the test is hanging.

Something like:

Signal.trap "SIGPROF" do
  Thread.list.each do |thread| 
    puts thread.name; 
    puts thread.backtrace; 
    puts
  end
end

Then send a PROF signal to the test process when it is hanging.

davidstosik · 2024-10-29T01:01:17Z

Thanks @djmb!
I added this to the beginning of the hanging test:

      Signal.trap "SIGPROF" do
        Thread.list.each do |thread| 
          puts "🌊" * 10
          puts thread.name
          puts thread.backtrace
          puts
        end
      end

Ran the test until it hands, then executed kill -s SIGPROF {the hanged process pid}. Results are:


Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊

/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:67:in `backtrace'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:67:in `block (2 levels) in test_server_command_broadcast_logs'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:64:in `each'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:64:in `block in test_server_command_broadcast_logs'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:229:in `wait_readable'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:229:in `rbuf_fill'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:199:in `readuntil'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:209:in `readline'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2342:in `block in transport_request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2333:in `catch'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2333:in `transport_request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2306:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2299:in `block in request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1570:in `start'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2297:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1917:in `get'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:77:in `block in test_server_command_broadcast_logs'
<internal:kernel>:90:in `tap'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:76:in `test_server_command_broadcast_logs'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:94:in `block (2 levels) in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:190:in `capture_exceptions'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:89:in `block in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:367:in `time_it'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:88:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:1207:in `run_one_method'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:446:in `run_one_method'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:433:in `block (2 levels) in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:429:in `each'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:429:in `block in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:471:in `on_signal'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:458:in `with_info_handler'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:428:in `run'
/Users/sto/src/github.com/rails/rails/railties/lib/rails/test_unit/line_filtering.rb:10:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `block in __run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `map'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `__run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:287:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:85:in `block in autorun'

🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊

<internal:thread_sync>:18:in `pop'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/parallel.rb:30:in `block (2 levels) in start'

# (abridging: the backtrace above appears 11 times) 

🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊
Timeout stdlib thread
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `sleep'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `wait'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `block (2 levels) in create_timeout_thread'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:108:in `synchronize'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:108:in `block in create_timeout_thread'

(Most threads don't seem to have a name., but at least I have traces.)

Looking at the first thread's backtrace:

...
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
...
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1570:in `start'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2297:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1917:in `get'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:77:in `block in test_server_command_broadcast_logs'
...

It does look like the test is stuck on net.get("/").

zzak · 2024-10-29T02:30:48Z

I wonder if just adding a sleep after spawning the process also solves it for you, like the process may have output "Listening on..." but it's not accepting connections yet causing the request to hang.

davidstosik · 2024-10-30T13:10:24Z

@zzak I tried adding sleep 5 after assert_output("Listening on", primary), before Net::HTTP.new, but the test is still hanging.

I'm still not pretending I understand whatever's happening with PTY, but for the record, the change below would work as well.

I'm not suggesting we should use it, but it might help understand where the problem is?

diff --git i/railties/test/console_helpers.rb w/railties/test/console_helpers.rb
index 10d3a54602..b871051aad 100644
--- i/railties/test/console_helpers.rb
+++ w/railties/test/console_helpers.rb
@@ -12,7 +12,7 @@ def assert_output(expected, io, timeout = 10)
     output = +""
     until output.include?(expected) || Time.now > timeout
       if IO.select([io], [], [], 0.1)
-        output << io.read(1)
+        output << (io.read(1) || "")
       end
     end
 
diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..1289759de3 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,23 @@ def test_server_command_work_inside_engine
     end
 
     def test_server_command_broadcast_logs
       primary, replica = PTY.open
-      pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
-      assert_output("Listening on", primary)
+      out_io = File.open("test_out", "w")
+      pid = spawn_command(
+        "server", replica, env: { "RAILS_ENV" => "development" },
+        out_io: out_io
+      )
+
+      assert_output("Listening on", File.open("test_out", "r"))
 
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
@@ -73,7 +87,7 @@ def test_server_command_broadcast_logs
         assert_match("Processing by Rails::WelcomeController", logs)
       end
 
-      assert_output("Processing by Rails::WelcomeController", primary)
+      assert_output("Processing by Rails::WelcomeController", File.open("test_out", "r"))
     ensure
       kill(pid)
     end
@@ -84,9 +98,10 @@ def plugin_path
       "#{@destination_root}/bukkits"
     end
 
-    def spawn_command(command, fd, env: {})
+    def spawn_command(command, fd, env: {}, out_io: nil)
+      out_io ||= fd
       in_plugin_context(plugin_path) do
-        Process.spawn(env, "bin/rails #{command}", in: fd, out: fd, err: fd)
+        Process.spawn(env, "bin/rails #{command}", in: fd, out: out_io, err: fd)
       end
     end

As you can see, if I spawn the process to write into a file instead of the replica file descriptor returned by PTY.open, and if I read to that file instead of reading from the primary file descriptor, then the test works. (Whereas, I checked, it still does not work for me on macOS.)

davidstosik · 2024-10-30T14:09:01Z

I wrote a lot more than what's below, but now I've identified the issue more clearly, what's next should suffice.

If you want to see the process that got me here, you can expand this.

Here's another experiment: on main, apply this diff:

diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..a740054139 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,18 @@ def test_server_command_work_inside_engine
     end
 
     def test_server_command_broadcast_logs
+      in_plugin_context(plugin_path) do
+        puts
+        puts `pwd`.strip
+        puts "Press Enter to continue"
+        $stdin.gets
+      end
+
       primary, replica = PTY.open
       pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
+
       assert_output("Listening on", primary)
+      puts "Press Enter to continue"
 
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")

It outputs the path to the dummy app used in the test, then pauses the test in two places:

Before starting the Rails server.
Before issuing the HTTP request to /.

This is how it goes:

Run the test.
At the first pause, I can open a second terminal and run tail -F {dummy app path}/test/dummy/log/development.log to see what the Rails app outputs in real time.
The file starts empty.
Then I hit Enter in the test process, to proceed to the Rails application start, and stop before making the HTTP request. The log file stays empty.

Then I press Enter again, and the tail -F process watching the log file now outputs this:

Started GET "/" for 127.0.0.1 at 2024-10-30 22:55:38 +0900
   (0.8ms)  CREATE TABLE "schema_migrations" ("version" varchar NOT NULL PRIMARY KEY) /*application='Dummy'*/
   (0.1ms)  CREATE TABLE "ar_internal_metadata" ("key" varchar NOT NULL PRIMARY KEY, "value" varchar, "created_at" datetime(6) NOT NULL, "updated_at" datetime(6) NOT NULL) /*application='Dummy'*/
  ActiveRecord::SchemaMigration Load (0.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application='Dummy'*/
Processing by Rails::WelcomeController#index as */*
  Rendering /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb
  Rendered /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb (Duration: 3.6ms | GC: 3.2ms)

Then the test process hangs.
Finally, I hit Ctrl-C on the test process.
Once does nothing, so I do a second time.
The test process is killed.

At the same time, this line gets added at the bottom of the log file:

Completed 200 OK in 71931ms (Views: 71890.0ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.2ms)

I don't know what it is, but it appears that the Rails server receives the request, renders its response, but is unable to complete?

If, instead of resuming the test execution after the second pause, I browse to http://localhost:3000, I get the same behaviour: Rails renders the page, but does not complete. On the browser side, spinner never stops. As soon as I kill the test, my browser shows the expected Rails Welcome page.

Important

TL;DR: macOS's pseudo terminal internal buffer size (that's a mouthful) is only 1kB, and that can't even accommodate the output generated by rails server booting and serving a single request.
In other common OSes, for example Debian-based Linux, it'll be way more than enough, like 19kB, 131kB...
Since we can't guarantee that buffer size will not decrease in the future, or that Rails server's output will not outgrow that size, I think the fix I originally suggested (use a thread so we can consume the output as it is produced) is justified.

Here's one last experiment.
Just add primary.read(196) before sending the HTTP request:

diff --git c/railties/test/engine/commands_test.rb i/railties/test/engine/commands_test.rb
index 3e888ce0d8..1781650704 100644
--- c/railties/test/engine/commands_test.rb
+++ i/railties/test/engine/commands_test.rb
@@ -64,6 +64,8 @@ def test_server_command_broadcast_logs
       pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
       assert_output("Listening on", primary)
 
+      primary.read(196)
+
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
       end

This is another one of those things that are not a proper fix, but help understanding what's happening.

I empties primary buffer from all the stuff Rails outputs at startup, and apparently, that's all Rails needs to be able to continue serving the request and completing the response, making the test pass successfully.

Understanding a bit more about the issue, I wrote this simple Ruby script:

require "pty"

(500..).each do |size|
  size = size*10
  puts "Spawning a process that outputs #{size} characters."
  primary, replica = PTY.open
  pid = Process.spawn("echo -n #{'a' * size}", in: replica, out: replica, err: replica)
  Process.wait(pid)
  sleep 0.1
end

Running this script will keep spawning an echo aaaa command with more as on a new PTY, until it hangs. It hangs at 1030 characters. I dialed it down and identified that spawning a process that echoes 1024 characters works, but spawning a process that echoes 1025 characters will hang!

So PTY has a 1kB buffer and hangs when that buffer is full? Is this a macOS limitation? What would be that number on Linux? Can we change it?

I found this superuser post: Ptys (Pseudo terminals) internal buffer size, which as script that confirms my assumptions:

On macOS, the pseudo terminal internal buffer size is 1kB (confirmed by running ptsbufsize.py).
On Debian, it varies depending on the version, but seems to be way higher (19kB, 131kB...)
Overall, I guess we can't really trust it to be big enough?

Going back to our test, here's what the Rails server outputs:

When it starts:

=> Booting Puma
=> Rails 8.1.0.alpha application starting in development
=> Run `bin/rails server --help` for more startup options
Puma starting in single mode...
* Puma version: 6.4.3 (ruby 3.3.1-p55) ("The Eagle of Durango")
*  Min threads: 3
*  Max threads: 3
*  Environment: development
*          PID: 76984
* Listening on http://127.0.0.1:3000
* Listening on http://[::1]:3000
* Listening on http://127.255.255.0:3000
* Listening on http://127.0.2.2:3000
* Listening on http://127.0.2.3:3000
Use Ctrl-C to stop

That's 533 characters.

When it receives the request to /:

Started GET "/" for 127.0.0.1 at 2024-10-30 23:58:29 +0900
   (0.9ms)  CREATE TABLE "schema_migrations" ("version" varchar NOT NULL PRIMARY KEY) /*application='Dummy'*/
   (0.1ms)  CREATE TABLE "ar_internal_metadata" ("key" varchar NOT NULL PRIMARY KEY, "value" varchar, "created_at" datetime(6) NOT NULL, "updated_at" datetime(6) NOT NULL) /*application='Dummy'*/
  ActiveRecord::SchemaMigration Load (0.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application='Dummy'*/
Processing by Rails::WelcomeController#index as */*
  Rendering /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb
  Rendered /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb (Duration: 0.5ms | GC: 0.1ms)
Completed 200 OK in 67ms (Views: 2.8ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.7ms)

That's 1023 characters (including color escaping codes). It's a freaky coincidence that that number would be just below the 1kB limit!

Calling assert_output "eats" all characters in the buffer until it can find the string it looked for. So assert_output("Listening to") will eat 327 characters, which explains why I can fix the test with primary.read(196)(533-327 = 196).

So what can we do?

Add primary.read(196) to consume as much output as possible before we launch the HTTP request. I've confirmed above that it works.
Use an assert_output that will "eat" more of the early output. Ideally, we can assert_output("Use Ctrl-C to stop\r" to consume all the Rails server's boot output.

In my situation however, the size of the Rails server output is so big (it depends on the paths where the partials are), that even not including \r from the assert_output above would make the output bigger than 1024 characters long, hanging the process.

I tried cloning the Rails repository in a shorter path, to see if my long paths were to blame:

gh repo clone rails/rails /tmp/rails
cd /tmp/rails/railties
bin/test -v test/engine/commands_test.rb:62

Unfortunately, the output is still too long without a fix, and the test hangs.

It looks like all the fix ideas I could come up with above are limited, and not guaranteed to continue to work in the future:

what if the pseudo terminal buffer size is decreased again in the future?
what if Rails server outputs more stuff to the screen when serving a request?

In conclusion, I think the fix I'm suggesting in this PR is reasonable: run the HTTP request in a separate thread so we can keep consuming the rails server process' output as it is produced.

zzak · 2024-11-03T00:21:33Z

@davidstosik Thanks for your detailed investigation.

I was able to reproduce the issue using a macos runner on github actions, and indeed your fix to use a thread to make the http request works well:
https://github.com/zzak/rails/actions/runs/11639390507

Here is my config for reference:

name: Debug railtie test on macos

on:
  push:
  workflow_dispatch:

jobs:
  test:
    runs-on: macos-latest
    steps:
      - name: checkout
        uses: actions/checkout@v3

      - name: setup-ruby
        uses: ruby/setup-ruby@v1
        with:
          ruby-version: 3.3
          bundler-cache: true
        env:
          BUNDLE_WITHOUT: "db:job"

      - name: run test
        env:
          BUNDLE_WITHOUT: "db:job"
        run: |
          cd railties
          bin/test test/engine/commands_test.rb:62

It indeed seems like this issue is due to pty, and there are few tests which rely on it, none of them are making a consecutive http request. I was thinking this might affect other tests, or that we should try to make it re-useable in case this comes up again. But after running the majority of the tests on that runner, and checking server_test, console_test, and dbconsole_test, this seems like a rare case and this patch is probably safe to merge.

davidstosik · 2024-11-03T06:37:11Z

We can generalize if/when we find other hanging tests later. 👍🏻

davidstosik · 2024-11-03T06:38:38Z

railties/test/engine/commands_test.rb

+        Net::HTTP.new("127.0.0.1", 3000).tap do |net|
+          net.get("/")
+        end


Oh by the way, why was this not simply?

Net::HTTP.get("127.0.0.1", "/", 3000)

I'm not sure but @Edouard-chin added it in #49417, so maybe he has more context on this particular test. I'm surprised he did't run into this, but maybe was not using a mac at the time?

Thank you for the great investigation, TIL about pty and the limitation on various platform ! Sorry I introduced the bug 😭 , I didn't realise as I didn't encounter it, strangely, even now on a mac.

Oh by the way, why was this not simply? [...]

No particular reason, feel free to change it!

For a suggestion: while your fix is great, I realise now that I don't think this test needs PTY (and the rest of this test suit neither I believe), it just needs a pair of connected IO. I tried a few dozen times, and it works correctly, but please try on your machine

diff --git a/railties/test/engine/commands_test.rb b/railties/test/engine/commands_test.rb index 3e888ce0d8..97c7a6d1f0 100644 --- a/railties/test/engine/commands_test.rb +++ b/railties/test/engine/commands_test.rb @@ -60,9 +60,10 @@ def test_server_command_work_inside_engine end def test_server_command_broadcast_logs - primary, replica = PTY.open - pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" }) - assert_output("Listening on", primary) + read, write = IO.pipe + pid = spawn_command("server", write, env: { "RAILS_ENV" => "development" }) + + assert_output("Listening on", read) Net::HTTP.new("127.0.0.1", 3000).tap do |net| net.get("/") @@ -73,7 +74,7 @@ def test_server_command_broadcast_logs assert_match("Processing by Rails::WelcomeController", logs) end - assert_output("Processing by Rails::WelcomeController", primary) + assert_output("Processing by Rails::WelcomeController", read) ensure kill(pid) end

@Edouard-chin Thanks for your comment!

I didn't encounter it, strangely, even now on a mac.

That is indeed strange! Have you tried running this ptsbufsize.py script locally?

Here's a Ruby version too if that's easier to run.

#!/usr/bin/env ruby require "pty" require "timeout" primary, replica = PTY.open puts "Using devices #{primary.path} ; #{replica.path} as PTY." begin bytes_written = 0 loop do Timeout.timeout(1) do replica.write("a") end bytes_written += 1 end rescue Timeout::Error puts "Timed out after writing #{bytes_written} bytes." end

On my machine it stops and outputs this:

Using devices masterpty:/dev/ttys010 ; /dev/ttys010 as PTY. Timed out after writing 1024 bytes.

I tested in Terminal.app, iTerm2 and Alacritty, in and out of tmux, and always got the same result.
(Supposedly, those don't make a difference anyway, since PTY would be using macOS' TTY/PTY devices...)

I'll try out your suggestion.

@Edouard-chin Thanks again for the suggestion!

I applied the changes you suggested and it looks like the test still works. That also means we don't need to check available_pty? for this specific test.

After looking at all tests in the file, it looks like only the dbconsole one actually needs to run in a pseudo-terminal.
All the others seem to be doing just fine with IO.pipe, so I changed them to do so.

Thanks for the script, I can reproduce with it 100% of the time with it.
I also wrote a script to run the test in a loop and got it to hang after the test ran 4 times. I suppose there is a race condition which makes the test succeed if the buffer reads the "Listening on" (and thus add available space to the buffer) before the get request is hit.

davidstosik · 2024-11-04T05:10:54Z

railties/test/engine/commands_test.rb

-      primary.puts "quit"
-    end
+  def test_console_command_work_inside_engine
+    read, write = IO.pipe


Maybe we want to keep using PTY for this test too, since it's opening an interactive Rails console?

Yeah, I think it's probably more agnostic to IRB's behaviour (and ultimately a closer representation of user experience) to keep it here... though honestly pretty unlikely to ever come up as a concern either way.

matthewd · 2024-11-04T07:12:23Z

Great investigation on this!

I do think we should still go with a thread wrapping the get, even with the pipe: we're in less danger of platform variance etc, but I think your first instinct is still correct, in that it's "wrong" to perform a blocking operation and just hope that there's enough buffer space to handle whatever it produces before we start reading.

davidstosik · 2024-11-04T08:18:32Z

@matthewd Thank you for the review!
I checked IO.pipe, and indeed, it also has a buffer size limit, although way bigger than PTY: it's 64kB, which feels a lot more comfortable than PTY's 1kB!

Happy to restore the earlier threaded approach if that still feels better.

Update: I removed the use of `PTY` for the tests that don't need it. The original issue was caused by the buffer size of a pseudo terminal being too small on macOS (1kB). Note that the `dbconsole` test still use `PTY`, as it appears SQLite requires a proper terminal to run correctly. Other tests seem to fare well with `IO.pipe` IOs, reducing complexity. --- Alright, I'm not going to pretend that I understand exactly what's happening, but the fact is that this seems to fix something, so I'll try to explain what I understand. ## The problem Running the following test hangs forever: ```sh-session $ bin/test test/engine/commands_test.rb:62 Run options: -v --seed 4349 # Running: Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here ``` Somehow I seem to have reproduced it with the command that would run on CI: ```sh-session $ TEST_DIR=engine TESTOPTS="-v -n test_server_command_broadcast_logs" bundle exec rake test --- test/engine/commands_test.rb /Users/sto/.rubies/ruby-3.3.5/bin/ruby -w -Itest -Ilib -I../activesupport/lib -I../actionpack/lib -I../actionview/lib -I../activemodel/lib test/engine/commands_test.rb Run options: -v -n test_server_command_broadcast_logs --seed 59220 # Running: Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here ``` I'm gonna venture a guess and assume that maybe this test (and 3 others) do not run in CI because of this condition: https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/engine/commands_test.rb#L36 https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/console_helpers.rb#L23-L25 Is there a way we can check if these tests ever ran on CI? ## What seems to be happening I do not understand much of how `PTY` works, but what I have identified is that 1. The test hangs at `net.get("/")`. Using a bit of debugging, I confirmed that I was not able to `curl http://localhost:3000` after spawning `rails server`. 2. It appears the process is entering a lock where: - it will not continue processing and complete the request until `rails server`'s output has been "consumed" (`read`, `readlines`) out of the `IO` object returned by `PTY.new`. - we won't be reading more output out of `primary` since `net.get("/")` is blocked ## The workaround I have no idea whether this is an acceptable fix, but parallelizing, and allowing our test thread to try and read (blocking) the PTY's output, while a different thread makes the HTTP request (blocking too), seems to be fixing the problem. You can try running any of the two commands I shared above and confirm the test will succeed now. ## Questions - Is there a way we can check if these tests ever ran on CI? - Is my fix an acceptable approach? - Has this test ever worked in the past? (Locally/On CI) - Has anything changed (maybe in Puma), making Puma's output blocking, and preventing it to continue processing a request until all its output has been consumed?

rails-bot bot added the railties label Oct 7, 2024

davidstosik force-pushed the sto/fix-hanging-test branch 2 times, most recently from c7c3aa7 to 7a6d555 Compare October 8, 2024 00:20

zzak approved these changes Nov 3, 2024

View reviewed changes

davidstosik commented Nov 3, 2024

View reviewed changes

davidstosik force-pushed the sto/fix-hanging-test branch from 7a6d555 to 086cb0b Compare November 4, 2024 05:06

davidstosik requested review from zzak and Edouard-chin November 4, 2024 05:07

davidstosik force-pushed the sto/fix-hanging-test branch from 086cb0b to 2da5499 Compare November 4, 2024 05:10

davidstosik commented Nov 4, 2024

View reviewed changes

davidstosik force-pushed the sto/fix-hanging-test branch from 2da5499 to c3a49a6 Compare November 4, 2024 08:11

davidstosik force-pushed the sto/fix-hanging-test branch from c3a49a6 to 54775e8 Compare November 4, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hanging test due to interlocking(?) #53215

Fix hanging test due to interlocking(?) #53215

davidstosik commented Oct 7, 2024 •

edited

Loading

zzak commented Oct 19, 2024

davidstosik commented Oct 19, 2024

zzak commented Oct 19, 2024

davidstosik commented Oct 21, 2024 •

edited

Loading

zzak commented Oct 21, 2024

davidstosik commented Oct 21, 2024

djmb commented Oct 22, 2024 •

edited

Loading

davidstosik commented Oct 29, 2024 •

edited

Loading

zzak commented Oct 29, 2024

davidstosik commented Oct 30, 2024 •

edited

Loading

davidstosik commented Oct 30, 2024 •

edited

Loading

zzak commented Nov 3, 2024

davidstosik commented Nov 3, 2024 •

edited

Loading

davidstosik Nov 3, 2024

zzak Nov 3, 2024

Edouard-chin Nov 3, 2024

davidstosik Nov 4, 2024 •

edited

Loading

davidstosik Nov 4, 2024

Edouard-chin Nov 4, 2024

davidstosik Nov 4, 2024

matthewd Nov 4, 2024

matthewd commented Nov 4, 2024

davidstosik commented Nov 4, 2024

	def available_pty?
	defined?(PTY) && PTY.respond_to?(:open)
	end

Fix hanging test due to interlocking(?) #53215

Are you sure you want to change the base?

Fix hanging test due to interlocking(?) #53215

Conversation

davidstosik commented Oct 7, 2024 • edited Loading

The problem

What seems to be happening

The workaround

Questions

Motivation / Background

Checklist

zzak commented Oct 19, 2024

davidstosik commented Oct 19, 2024

zzak commented Oct 19, 2024

davidstosik commented Oct 21, 2024 • edited Loading

zzak commented Oct 21, 2024

davidstosik commented Oct 21, 2024

djmb commented Oct 22, 2024 • edited Loading

davidstosik commented Oct 29, 2024 • edited Loading

zzak commented Oct 29, 2024

davidstosik commented Oct 30, 2024 • edited Loading

davidstosik commented Oct 30, 2024 • edited Loading

zzak commented Nov 3, 2024

davidstosik commented Nov 3, 2024 • edited Loading

davidstosik Nov 3, 2024

Choose a reason for hiding this comment

zzak Nov 3, 2024

Choose a reason for hiding this comment

Edouard-chin Nov 3, 2024

Choose a reason for hiding this comment

davidstosik Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

davidstosik Nov 4, 2024

Choose a reason for hiding this comment

Edouard-chin Nov 4, 2024

Choose a reason for hiding this comment

davidstosik Nov 4, 2024

Choose a reason for hiding this comment

matthewd Nov 4, 2024

Choose a reason for hiding this comment

matthewd commented Nov 4, 2024

davidstosik commented Nov 4, 2024

davidstosik commented Oct 7, 2024 •

edited

Loading

davidstosik commented Oct 21, 2024 •

edited

Loading

djmb commented Oct 22, 2024 •

edited

Loading

davidstosik commented Oct 29, 2024 •

edited

Loading

davidstosik commented Oct 30, 2024 •

edited

Loading

davidstosik commented Oct 30, 2024 •

edited

Loading

davidstosik commented Nov 3, 2024 •

edited

Loading

davidstosik Nov 4, 2024 •

edited

Loading