Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hanging test due to interlocking(?) #53215

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davidstosik
Copy link
Contributor

@davidstosik davidstosik commented Oct 7, 2024

Alright, I'm not going to pretend that I understand exactly what's happening, but the fact is that this seems to fix something, so I'll try to explain what I understand.

The problem

Running the following test hangs forever:

$ bin/test test/engine/commands_test.rb:62
Run options: -v --seed 4349

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here

Somehow I seem to have reproduced it with the command that would run on
CI:

$ TEST_DIR=engine TESTOPTS="-v -n test_server_command_broadcast_logs"  bundle exec rake test

--- test/engine/commands_test.rb
/Users/sto/.rubies/ruby-3.3.5/bin/ruby -w -Itest -Ilib -I../activesupport/lib -I../actionpack/lib -I../actionview/lib -I../activemodel/lib test/engine/commands_test.rb
Run options: -v -n test_server_command_broadcast_logs --seed 59220

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here

I'm gonna venture a guess and assume that maybe this test (and 3 others) do not run in CI because of this condition:

def available_pty?
defined?(PTY) && PTY.respond_to?(:open)
end

Is there a way we can check if these tests ever ran on CI?

What seems to be happening

I do not understand much of how PTY works, but what I have identified is that:

  1. The test hangs at net.get("/"). Using a bit of debugging, I confirmed that I was not able to curl http://localhost:3000 after spawning rails server.
  2. It appears the process is entering a lock where:
    • it will not continue processing and complete the request until rails server's output has been "consumed" (read, readlines) out of the IO object returned by PTY.new.
    • we won't be reading more output out of primary since net.get("/") is blocked

The workaround

I have no idea whether this is an acceptable fix, but parallelizing, and allowing our test thread to try and read (blocking) the PTY's output, while a different thread makes the HTTP request (blocking too), seems to be fixing the problem.

You can try running any of the two commands I shared above and confirm the test will succeed now.

Questions

  • Is there a way we can check if these tests ever ran on CI?
  • Is my fix an acceptable approach?
  • Has this test ever worked in the past? (Locally/On CI)
  • Has anything changed (maybe in Puma), making Puma's output blocking, and preventing it to continue processing a request until all its output has been consumed?

Motivation / Background

This Pull Request has been created because [REPLACE ME]

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

@rails-bot rails-bot bot added the railties label Oct 7, 2024
@davidstosik davidstosik force-pushed the sto/fix-hanging-test branch 2 times, most recently from c7c3aa7 to 7a6d555 Compare October 8, 2024 00:20
@zzak
Copy link
Member

zzak commented Oct 19, 2024

I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that.

@davidstosik
Copy link
Contributor Author

I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that.

No you're right. I made a guess but I was wrong. For example on the CI build for this branch, I can see the test/engine/commands_test.rb test file has 6 runs, which matches.(here)

Now the question is, why do those tests do not hang on CI?
@zzak I'm sure you've you tried running them locally before, do they still run on your machine?

@zzak
Copy link
Member

zzak commented Oct 19, 2024

I'm unable to reproduce it locally:

bin/test -v test/engine/commands_test.rb:62
Run options: -v --seed 6467

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 2.14 s = .

Finished in 2.145408s, 0.4661 runs/s, 2.7967 assertions/s.
1 runs, 6 assertions, 0 failures, 0 errors, 0 skips

I wonder if it has something to do with the spawn_command in that file being a unique case, compared to the other tests. That is just a hunch though.

def rails(*args, allow_failure: false, stderr: false)

@davidstosik
Copy link
Contributor Author

davidstosik commented Oct 21, 2024

@zzak I just rebooted my laptop to be sure. Haven't got any Rails app running.
The test still hangs.

(macOS Sequoia, Ruby 3.3.5)

While the test is hanging, I can see ruby listening to the port 3000, which disappears as soon as I kill the hanging test:

$  lsof -i -P | grep LISTEN | grep ruby
ruby      4656  sto    7u  IPv4 0x3ba087057954dfb8      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto    8u  IPv6 0xc7ccd4d0b84fd561      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto    9u  IPv4 0x4593af10ba1eeabc      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto   10u  IPv4 0x405f2a704fdff54e      0t0  TCP connectivity-check.warp-svc:3000 (LISTEN)
ruby      4656  sto   11u  IPv4 0xe41de747218cb552      0t0  TCP connectivity-check.warp-svc:3000 (LISTEN)

Update: tried on a different Mac. Same. Will try on a Linux machine soon.

Update again: took a while to set up and run, but yeah, this test passes on Ubuntu 24.04 with Ruby 3.3.5. Takes something like 13 seconds though.

@zzak
Copy link
Member

zzak commented Oct 21, 2024

I'm on linux, but I remember there being some differences in the native threads they used.
Worst case, we can try to spike on a macos runner on GH actions to get it to repro.

@davidstosik
Copy link
Contributor Author

Yep, seems to be a macOS issue, since I managed to get it to work on Linux too.

@djmb
Copy link
Contributor

djmb commented Oct 22, 2024

@davidstosik - if you add a signal handler that dumps the thread backtraces, it might help track down where the test is hanging.

Something like:

Signal.trap "SIGPROF" do
  Thread.list.each do |thread| 
    puts thread.name; 
    puts thread.backtrace; 
    puts
  end
end

Then send a PROF signal to the test process when it is hanging.

@davidstosik
Copy link
Contributor Author

davidstosik commented Oct 29, 2024

Thanks @djmb!
I added this to the beginning of the hanging test:

      Signal.trap "SIGPROF" do
        Thread.list.each do |thread| 
          puts "🌊" * 10
          puts thread.name
          puts thread.backtrace
          puts
        end
      end

Ran the test until it hands, then executed kill -s SIGPROF {the hanged process pid}. Results are:


Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊

/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:67:in `backtrace'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:67:in `block (2 levels) in test_server_command_broadcast_logs'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:64:in `each'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:64:in `block in test_server_command_broadcast_logs'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:229:in `wait_readable'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:229:in `rbuf_fill'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:199:in `readuntil'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:209:in `readline'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2342:in `block in transport_request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2333:in `catch'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2333:in `transport_request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2306:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2299:in `block in request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1570:in `start'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2297:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1917:in `get'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:77:in `block in test_server_command_broadcast_logs'
<internal:kernel>:90:in `tap'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:76:in `test_server_command_broadcast_logs'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:94:in `block (2 levels) in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:190:in `capture_exceptions'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:89:in `block in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:367:in `time_it'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:88:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:1207:in `run_one_method'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:446:in `run_one_method'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:433:in `block (2 levels) in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:429:in `each'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:429:in `block in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:471:in `on_signal'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:458:in `with_info_handler'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:428:in `run'
/Users/sto/src/github.com/rails/rails/railties/lib/rails/test_unit/line_filtering.rb:10:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `block in __run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `map'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `__run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:287:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:85:in `block in autorun'

🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊

<internal:thread_sync>:18:in `pop'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/parallel.rb:30:in `block (2 levels) in start'

# (abridging: the backtrace above appears 11 times) 

🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊
Timeout stdlib thread
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `sleep'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `wait'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `block (2 levels) in create_timeout_thread'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:108:in `synchronize'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:108:in `block in create_timeout_thread'

(Most threads don't seem to have a name., but at least I have traces.)

Looking at the first thread's backtrace:

...
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
...
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1570:in `start'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2297:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1917:in `get'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:77:in `block in test_server_command_broadcast_logs'
...

It does look like the test is stuck on net.get("/").

@zzak
Copy link
Member

zzak commented Oct 29, 2024

I wonder if just adding a sleep after spawning the process also solves it for you, like the process may have output "Listening on..." but it's not accepting connections yet causing the request to hang.

@davidstosik
Copy link
Contributor Author

davidstosik commented Oct 30, 2024

@zzak I tried adding sleep 5 after assert_output("Listening on", primary), before Net::HTTP.new, but the test is still hanging.

I'm still not pretending I understand whatever's happening with PTY, but for the record, the change below would work as well.

I'm not suggesting we should use it, but it might help understand where the problem is?

diff --git i/railties/test/console_helpers.rb w/railties/test/console_helpers.rb
index 10d3a54602..b871051aad 100644
--- i/railties/test/console_helpers.rb
+++ w/railties/test/console_helpers.rb
@@ -12,7 +12,7 @@ def assert_output(expected, io, timeout = 10)
     output = +""
     until output.include?(expected) || Time.now > timeout
       if IO.select([io], [], [], 0.1)
-        output << io.read(1)
+        output << (io.read(1) || "")
       end
     end
 
diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..1289759de3 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,23 @@ def test_server_command_work_inside_engine
     end
 
     def test_server_command_broadcast_logs
       primary, replica = PTY.open
-      pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
-      assert_output("Listening on", primary)
+      out_io = File.open("test_out", "w")
+      pid = spawn_command(
+        "server", replica, env: { "RAILS_ENV" => "development" },
+        out_io: out_io
+      )
+
+      assert_output("Listening on", File.open("test_out", "r"))
 
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
@@ -73,7 +87,7 @@ def test_server_command_broadcast_logs
         assert_match("Processing by Rails::WelcomeController", logs)
       end
 
-      assert_output("Processing by Rails::WelcomeController", primary)
+      assert_output("Processing by Rails::WelcomeController", File.open("test_out", "r"))
     ensure
       kill(pid)
     end
@@ -84,9 +98,10 @@ def plugin_path
       "#{@destination_root}/bukkits"
     end
 
-    def spawn_command(command, fd, env: {})
+    def spawn_command(command, fd, env: {}, out_io: nil)
+      out_io ||= fd
       in_plugin_context(plugin_path) do
-        Process.spawn(env, "bin/rails #{command}", in: fd, out: fd, err: fd)
+        Process.spawn(env, "bin/rails #{command}", in: fd, out: out_io, err: fd)
       end
     end

As you can see, if I spawn the process to write into a file instead of the replica file descriptor returned by PTY.open, and if I read to that file instead of reading from the primary file descriptor, then the test works. (Whereas, I checked, it still does not work for me on macOS.)

@davidstosik
Copy link
Contributor Author

davidstosik commented Oct 30, 2024

I wrote a lot more than what's below, but now I've identified the issue more clearly, what's next should suffice.

If you want to see the process that got me here, you can expand this.

Here's another experiment: on main, apply this diff:

diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..a740054139 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,18 @@ def test_server_command_work_inside_engine
     end
 
     def test_server_command_broadcast_logs
+      in_plugin_context(plugin_path) do
+        puts
+        puts `pwd`.strip
+        puts "Press Enter to continue"
+        $stdin.gets
+      end
+
       primary, replica = PTY.open
       pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
+
       assert_output("Listening on", primary)
+      puts "Press Enter to continue"
 
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")

It outputs the path to the dummy app used in the test, then pauses the test in two places:

  1. Before starting the Rails server.
  2. Before issuing the HTTP request to /.

This is how it goes:

  • Run the test.

  • At the first pause, I can open a second terminal and run tail -F {dummy app path}/test/dummy/log/development.log to see what the Rails app outputs in real time.

  • The file starts empty.

  • Then I hit Enter in the test process, to proceed to the Rails application start, and stop before making the HTTP request. The log file stays empty.

  • Then I press Enter again, and the tail -F process watching the log file now outputs this:

    Started GET "/" for 127.0.0.1 at 2024-10-30 22:55:38 +0900
       (0.8ms)  CREATE TABLE "schema_migrations" ("version" varchar NOT NULL PRIMARY KEY) /*application='Dummy'*/
       (0.1ms)  CREATE TABLE "ar_internal_metadata" ("key" varchar NOT NULL PRIMARY KEY, "value" varchar, "created_at" datetime(6) NOT NULL, "updated_at" datetime(6) NOT NULL) /*application='Dummy'*/
      ActiveRecord::SchemaMigration Load (0.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application='Dummy'*/
    Processing by Rails::WelcomeController#index as */*
      Rendering /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb
      Rendered /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb (Duration: 3.6ms | GC: 3.2ms)
    
  • Then the test process hangs.

  • Finally, I hit Ctrl-C on the test process.

  • Once does nothing, so I do a second time.

  • The test process is killed.

  • At the same time, this line gets added at the bottom of the log file:

    Completed 200 OK in 71931ms (Views: 71890.0ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.2ms)
    

I don't know what it is, but it appears that the Rails server receives the request, renders its response, but is unable to complete?

If, instead of resuming the test execution after the second pause, I browse to http://localhost:3000, I get the same behaviour: Rails renders the page, but does not complete. On the browser side, spinner never stops. As soon as I kill the test, my browser shows the expected Rails Welcome page.

Important

TL;DR: macOS's pseudo terminal internal buffer size (that's a mouthful) is only 1kB, and that can't even accommodate the output generated by rails server booting and serving a single request.
In other common OSes, for example Debian-based Linux, it'll be way more than enough, like 19kB, 131kB...
Since we can't guarantee that buffer size will not decrease in the future, or that Rails server's output will not outgrow that size, I think the fix I originally suggested (use a thread so we can consume the output as it is produced) is justified.

Here's one last experiment.
Just add primary.read(196) before sending the HTTP request:

diff --git c/railties/test/engine/commands_test.rb i/railties/test/engine/commands_test.rb
index 3e888ce0d8..1781650704 100644
--- c/railties/test/engine/commands_test.rb
+++ i/railties/test/engine/commands_test.rb
@@ -64,6 +64,8 @@ def test_server_command_broadcast_logs
       pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
       assert_output("Listening on", primary)
 
+      primary.read(196)
+
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
       end

This is another one of those things that are not a proper fix, but help understanding what's happening.

I empties primary buffer from all the stuff Rails outputs at startup, and apparently, that's all Rails needs to be able to continue serving the request and completing the response, making the test pass successfully.

Understanding a bit more about the issue, I wrote this simple Ruby script:

require "pty"

(500..).each do |size|
  size = size*10
  puts "Spawning a process that outputs #{size} characters."
  primary, replica = PTY.open
  pid = Process.spawn("echo -n #{'a' * size}", in: replica, out: replica, err: replica)
  Process.wait(pid)
  sleep 0.1
end

Running this script will keep spawning an echo aaaa command with more as on a new PTY, until it hangs. It hangs at 1030 characters. I dialed it down and identified that spawning a process that echoes 1024 characters works, but spawning a process that echoes 1025 characters will hang!

So PTY has a 1kB buffer and hangs when that buffer is full? Is this a macOS limitation? What would be that number on Linux? Can we change it?

I found this superuser post: Ptys (Pseudo terminals) internal buffer size, which as script that confirms my assumptions:

  • On macOS, the pseudo terminal internal buffer size is 1kB (confirmed by running ptsbufsize.py).
  • On Debian, it varies depending on the version, but seems to be way higher (19kB, 131kB...)
  • Overall, I guess we can't really trust it to be big enough?

Going back to our test, here's what the Rails server outputs:

  1. When it starts:
    => Booting Puma
    => Rails 8.1.0.alpha application starting in development
    => Run `bin/rails server --help` for more startup options
    Puma starting in single mode...
    * Puma version: 6.4.3 (ruby 3.3.1-p55) ("The Eagle of Durango")
    *  Min threads: 3
    *  Max threads: 3
    *  Environment: development
    *          PID: 76984
    * Listening on http://127.0.0.1:3000
    * Listening on http://[::1]:3000
    * Listening on http://127.255.255.0:3000
    * Listening on http://127.0.2.2:3000
    * Listening on http://127.0.2.3:3000
    Use Ctrl-C to stop
    
    That's 533 characters.
  2. When it receives the request to /:
    Started GET "/" for 127.0.0.1 at 2024-10-30 23:58:29 +0900
       (0.9ms)  CREATE TABLE "schema_migrations" ("version" varchar NOT NULL PRIMARY KEY) /*application='Dummy'*/
       (0.1ms)  CREATE TABLE "ar_internal_metadata" ("key" varchar NOT NULL PRIMARY KEY, "value" varchar, "created_at" datetime(6) NOT NULL, "updated_at" datetime(6) NOT NULL) /*application='Dummy'*/
      ActiveRecord::SchemaMigration Load (0.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application='Dummy'*/
    Processing by Rails::WelcomeController#index as */*
      Rendering /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb
      Rendered /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb (Duration: 0.5ms | GC: 0.1ms)
    Completed 200 OK in 67ms (Views: 2.8ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.7ms)
    
    That's 1023 characters (including color escaping codes). It's a freaky coincidence that that number would be just below the 1kB limit!

Calling assert_output "eats" all characters in the buffer until it can find the string it looked for. So assert_output("Listening to") will eat 327 characters, which explains why I can fix the test with primary.read(196)(533-327 = 196).

So what can we do?

  • Add primary.read(196) to consume as much output as possible before we launch the HTTP request. I've confirmed above that it works.
  • Use an assert_output that will "eat" more of the early output. Ideally, we can assert_output("Use Ctrl-C to stop\r" to consume all the Rails server's boot output.

In my situation however, the size of the Rails server output is so big (it depends on the paths where the partials are), that even not including \r from the assert_output above would make the output bigger than 1024 characters long, hanging the process.

I tried cloning the Rails repository in a shorter path, to see if my long paths were to blame:

gh repo clone rails/rails /tmp/rails
cd /tmp/rails/railties
bin/test -v test/engine/commands_test.rb:62

Unfortunately, the output is still too long without a fix, and the test hangs.

It looks like all the fix ideas I could come up with above are limited, and not guaranteed to continue to work in the future:

  • what if the pseudo terminal buffer size is decreased again in the future?
  • what if Rails server outputs more stuff to the screen when serving a request?

In conclusion, I think the fix I'm suggesting in this PR is reasonable: run the HTTP request in a separate thread so we can keep consuming the rails server process' output as it is produced.

@zzak
Copy link
Member

zzak commented Nov 3, 2024

@davidstosik Thanks for your detailed investigation.

I was able to reproduce the issue using a macos runner on github actions, and indeed your fix to use a thread to make the http request works well:
https://github.com/zzak/rails/actions/runs/11639390507

Here is my config for reference:
name: Debug railtie test on macos

on:
  push:
  workflow_dispatch:

jobs:
  test:
    runs-on: macos-latest
    steps:
      - name: checkout
        uses: actions/checkout@v3

      - name: setup-ruby
        uses: ruby/setup-ruby@v1
        with:
          ruby-version: 3.3
          bundler-cache: true
        env:
          BUNDLE_WITHOUT: "db:job"

      - name: run test
        env:
          BUNDLE_WITHOUT: "db:job"
        run: |
          cd railties
          bin/test test/engine/commands_test.rb:62

It indeed seems like this issue is due to pty, and there are few tests which rely on it, none of them are making a consecutive http request. I was thinking this might affect other tests, or that we should try to make it re-useable in case this comes up again. But after running the majority of the tests on that runner, and checking server_test, console_test, and dbconsole_test, this seems like a rare case and this patch is probably safe to merge.

@davidstosik
Copy link
Contributor Author

davidstosik commented Nov 3, 2024

We can generalize if/when we find other hanging tests later. 👍🏻

Comment on lines 68 to 70
Net::HTTP.new("127.0.0.1", 3000).tap do |net|
net.get("/")
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh by the way, why was this not simply?

Net::HTTP.get("127.0.0.1", "/", 3000)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but @Edouard-chin added it in #49417, so maybe he has more context on this particular test. I'm surprised he did't run into this, but maybe was not using a mac at the time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great investigation, TIL about pty and the limitation on various platform ! Sorry I introduced the bug 😭 , I didn't realise as I didn't encounter it, strangely, even now on a mac.

Oh by the way, why was this not simply? [...]

No particular reason, feel free to change it!


For a suggestion: while your fix is great, I realise now that I don't think this test needs PTY (and the rest of this test suit neither I believe), it just needs a pair of connected IO. I tried a few dozen times, and it works correctly, but please try on your machine

diff --git a/railties/test/engine/commands_test.rb b/railties/test/engine/commands_test.rb
index 3e888ce0d8..97c7a6d1f0 100644
--- a/railties/test/engine/commands_test.rb
+++ b/railties/test/engine/commands_test.rb
@@ -60,9 +60,10 @@ def test_server_command_work_inside_engine
     end

     def test_server_command_broadcast_logs
-      primary, replica = PTY.open
-      pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
-      assert_output("Listening on", primary)
+      read, write = IO.pipe
+      pid = spawn_command("server", write, env: { "RAILS_ENV" => "development" })
+
+      assert_output("Listening on", read)

       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
@@ -73,7 +74,7 @@ def test_server_command_broadcast_logs
         assert_match("Processing by Rails::WelcomeController", logs)
       end

-      assert_output("Processing by Rails::WelcomeController", primary)
+      assert_output("Processing by Rails::WelcomeController", read)
     ensure
       kill(pid)
     end

Copy link
Contributor Author

@davidstosik davidstosik Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Edouard-chin Thanks for your comment!

I didn't encounter it, strangely, even now on a mac.

That is indeed strange! Have you tried running this ptsbufsize.py script locally?

Here's a Ruby version too if that's easier to run.
#!/usr/bin/env ruby

require "pty"
require "timeout"

primary, replica = PTY.open

puts "Using devices #{primary.path} ; #{replica.path} as PTY."

begin
  bytes_written = 0
  loop do
    Timeout.timeout(1) do
      replica.write("a")
    end
    bytes_written += 1
  end
rescue Timeout::Error
  puts "Timed out after writing #{bytes_written} bytes."
end

On my machine it stops and outputs this:

Using devices masterpty:/dev/ttys010 ; /dev/ttys010 as PTY.
Timed out after writing 1024 bytes.

I tested in Terminal.app, iTerm2 and Alacritty, in and out of tmux, and always got the same result.
(Supposedly, those don't make a difference anyway, since PTY would be using macOS' TTY/PTY devices...)

I'll try out your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Edouard-chin Thanks again for the suggestion!

I applied the changes you suggested and it looks like the test still works. That also means we don't need to check available_pty? for this specific test.

After looking at all tests in the file, it looks like only the dbconsole one actually needs to run in a pseudo-terminal.
All the others seem to be doing just fine with IO.pipe, so I changed them to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the script, I can reproduce with it 100% of the time with it.
I also wrote a script to run the test in a loop and got it to hang after the test ran 4 times. I suppose there is a race condition which makes the test succeed if the buffer reads the "Listening on" (and thus add available space to the buffer) before the get request is hit.

primary.puts "quit"
end
def test_console_command_work_inside_engine
read, write = IO.pipe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to keep using PTY for this test too, since it's opening an interactive Rails console?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's probably more agnostic to IRB's behaviour (and ultimately a closer representation of user experience) to keep it here... though honestly pretty unlikely to ever come up as a concern either way.

@matthewd
Copy link
Member

matthewd commented Nov 4, 2024

Great investigation on this!

I do think we should still go with a thread wrapping the get, even with the pipe: we're in less danger of platform variance etc, but I think your first instinct is still correct, in that it's "wrong" to perform a blocking operation and just hope that there's enough buffer space to handle whatever it produces before we start reading.

@davidstosik
Copy link
Contributor Author

@matthewd Thank you for the review!
I checked IO.pipe, and indeed, it also has a buffer size limit, although way bigger than PTY: it's 64kB, which feels a lot more comfortable than PTY's 1kB!

Happy to restore the earlier threaded approach if that still feels better.

Update:

I removed the use of `PTY` for the tests that don't need it.
The original issue was caused by the buffer size of a pseudo terminal
being too small on macOS (1kB).

Note that the `dbconsole` test still use `PTY`, as it appears SQLite
requires a proper terminal to run correctly.

Other tests seem to fare well with `IO.pipe` IOs, reducing complexity.

---

Alright, I'm not going to pretend that I understand exactly what's
happening, but the fact is that this seems to fix something, so I'll try
to explain what I understand.

 ## The problem

Running the following test hangs forever:

```sh-session
$ bin/test test/engine/commands_test.rb:62
Run options: -v --seed 4349

 # Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs =

💥 Hangs here
```

Somehow I seem to have reproduced it with the command that would run on
CI:

```sh-session
$ TEST_DIR=engine TESTOPTS="-v -n test_server_command_broadcast_logs"  bundle exec rake test

--- test/engine/commands_test.rb
/Users/sto/.rubies/ruby-3.3.5/bin/ruby -w -Itest -Ilib -I../activesupport/lib -I../actionpack/lib -I../actionview/lib -I../activemodel/lib test/engine/commands_test.rb
Run options: -v -n test_server_command_broadcast_logs --seed 59220

 # Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs =

💥 Hangs here
```

I'm gonna venture a guess and assume that maybe this test (and 3 others)
do not run in CI because of this condition:

https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/engine/commands_test.rb#L36

https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/console_helpers.rb#L23-L25

Is there a way we can check if these tests ever ran on CI?

 ## What seems to be happening

I do not understand much of how `PTY` works, but what I have identified
is that

1. The test hangs at `net.get("/")`. Using a bit of debugging, I
   confirmed that I was not able to `curl http://localhost:3000` after
spawning `rails server`.
2. It appears the process is entering a lock where:
   - it will not continue processing and complete the request until
     `rails server`'s output has been "consumed" (`read`, `readlines`)
     out of the `IO` object returned by `PTY.new`.
   - we won't be reading more output out of `primary` since `net.get("/")`
     is blocked

 ## The workaround

I have no idea whether this is an acceptable fix, but parallelizing, and
allowing our test thread to try and read (blocking) the PTY's output,
while a different thread makes the HTTP request (blocking too), seems to
be fixing the problem.

You can try running any of the two commands I shared above and confirm
the test will succeed now.

 ## Questions

- Is there a way we can check if these tests ever ran on CI?
- Is my fix an acceptable approach?
- Has this test ever worked in the past? (Locally/On CI)
- Has anything changed (maybe in Puma), making Puma's output blocking,
  and preventing it to continue processing a request until all its
  output has been consumed?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants