Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle worker failures gracefully #1038

Merged
merged 15 commits into from
Oct 3, 2024
Merged

handle worker failures gracefully #1038

merged 15 commits into from
Oct 3, 2024

Conversation

withinboredom
Copy link
Collaborator

Inspired by #1013, this change keeps track of worker failures, implements an exponential backoff, and will eventually panic if there are too many failures.

This is a better alternative to writing hundreds of megabytes per second into log files.

It uses an algorithm to try to detect the difference between a crashing script and a fatal application error that just needs to be logged and can be safely restarted. It does this by waiting for the script to be up for at least "backoff * 2" seconds. If it is up at least that long, it will continue "crash-looping" until the application cannot stay up more than 32 seconds for at least 3 iterations. At that point, FrankenPHP itself will crash.

If we are happy with this algorithm, I'll create some tests and documentation.

@dunglas
Copy link
Owner

dunglas commented Sep 21, 2024

Good idea!

@withinboredom
Copy link
Collaborator Author

I actually, also found another bug where the Worker Ready WaitGroup will never become ready (if a script crashes before calling go_frankenphp_worker_ready, then it will add another to the WG on restart; even if it succeeds on the next start, the WG will never become zero). I'll fix it in this PR as well.

worker.go Outdated Show resolved Hide resolved
worker.go Outdated Show resolved Hide resolved
@@ -93,15 +124,47 @@ func startWorkers(fileName string, nbWorkers int, env PreparedEnv) error {

// TODO: make the max restart configurable
if _, ok := workersRequestChans.Load(absFileName); ok {
workersReadyWG.Add(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also encountered this bug in the filewatcher branch. My approach was to entirely ignore the waitgroup after workers have started once, aka remove this line and do something like this:

func go_frankenphp_worker_ready() {
	if !workersAreReady.Load() {
		workersReadyWG.Done()
	}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once metrics are merged, I plan to add a 'workers ready' metric, so I want it to be toggleable.

Copy link
Collaborator Author

@withinboredom withinboredom Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other news, that specific approach has a race condition (I tried it, among many other implementations) because the workersAreReady can become true at different times and cause a deadlock or workersReadyWG to go negative.

Example case of the problem:

  1. all workers become ready at the same time
  2. between the point where the wait group finishes and before workersAreReady is set to true, a worker could crash on the first request and end up here, causing the wait group to go negative and panic. It is unlikely, but possible.

docs/worker.md Outdated Show resolved Hide resolved
@@ -122,6 +122,9 @@ type FrankenPHPContext struct {
// Whether the request is already closed by us
closed sync.Once

// whether the context is ready to receive requests
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of what it means. The thread?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. basically. This becomes true after the worker has called frankenphp_handle_request for the first time and false upon restart.

The tricky bit here is that we have a WG that is waiting for the worker to call frankenphp_handle_request, but it might never happen (e.g., a crash). In this case, we don't want to increment our WG on restart; so we need some way to track state.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe "whether the thread..." could be more explicit?

testdata/failing-worker.php Outdated Show resolved Hide resolved
testdata/failing-worker.php Outdated Show resolved Hide resolved
@dunglas
Copy link
Owner

dunglas commented Oct 2, 2024

Can we merge this @withinboredom ?

@withinboredom
Copy link
Collaborator Author

Oh! Yes, let me fix the conflicts and add crash metrics! Totally forgot about this PR, I've been sick for a few days.

@withinboredom
Copy link
Collaborator Author

Added crash/restart/ready metrics.

@withinboredom withinboredom merged commit aa585f7 into main Oct 3, 2024
46 checks passed
@withinboredom withinboredom deleted the handle-failures branch October 3, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants