-
-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle worker failures gracefully #1038
Conversation
Good idea! |
I actually, also found another bug where the Worker Ready WaitGroup will never become ready (if a script crashes before calling |
@@ -93,15 +124,47 @@ func startWorkers(fileName string, nbWorkers int, env PreparedEnv) error { | |||
|
|||
// TODO: make the max restart configurable | |||
if _, ok := workersRequestChans.Load(absFileName); ok { | |||
workersReadyWG.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also encountered this bug in the filewatcher branch. My approach was to entirely ignore the waitgroup after workers have started once, aka remove this line and do something like this:
func go_frankenphp_worker_ready() {
if !workersAreReady.Load() {
workersReadyWG.Done()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once metrics are merged, I plan to add a 'workers ready' metric, so I want it to be toggleable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other news, that specific approach has a race condition (I tried it, among many other implementations) because the workersAreReady
can become true at different times and cause a deadlock or workersReadyWG
to go negative.
Example case of the problem:
- all workers become ready at the same time
- between the point where the wait group finishes and before
workersAreReady
is set to true, a worker could crash on the first request and end up here, causing the wait group to go negative and panic. It is unlikely, but possible.
@@ -122,6 +122,9 @@ type FrankenPHPContext struct { | |||
// Whether the request is already closed by us | |||
closed sync.Once | |||
|
|||
// whether the context is ready to receive requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of what it means. The thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. basically. This becomes true
after the worker has called frankenphp_handle_request
for the first time and false
upon restart.
The tricky bit here is that we have a WG that is waiting for the worker to call frankenphp_handle_request
, but it might never happen (e.g., a crash). In this case, we don't want to increment our WG on restart; so we need some way to track state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe "whether the thread..." could be more explicit?
Can we merge this @withinboredom ? |
Oh! Yes, let me fix the conflicts and add crash metrics! Totally forgot about this PR, I've been sick for a few days. |
Co-authored-by: Kévin Dunglas <[email protected]>
0f5140f
to
51e90da
Compare
Added crash/restart/ready metrics. |
Inspired by #1013, this change keeps track of worker failures, implements an exponential backoff, and will eventually panic if there are too many failures.
This is a better alternative to writing hundreds of megabytes per second into log files.
It uses an algorithm to try to detect the difference between a crashing script and a fatal application error that just needs to be logged and can be safely restarted. It does this by waiting for the script to be up for at least "backoff * 2" seconds. If it is up at least that long, it will continue "crash-looping" until the application cannot stay up more than 32 seconds for at least 3 iterations. At that point, FrankenPHP itself will crash.
If we are happy with this algorithm, I'll create some tests and documentation.