-
Notifications
You must be signed in to change notification settings - Fork 100
Avoid emitting probe metrics after context cancellation #793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Avoid emitting probe metrics after context cancellation #793
Conversation
test passing, should be ready now |
|
||
// exit without exporting stats if the probe run was interrupted. | ||
select { | ||
case <-ctx.Done(): | ||
return | ||
default: | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since runProbe is not impacted by context, this is pretty much inconsequential, right? I'd drop ping.go from this change.
@@ -422,7 +430,7 @@ func (p *Probe) runProbe(startCtx context.Context) { | |||
if p.mode == "server" { | |||
p.runServerProbe(probeCtx, startCtx) | |||
} else { | |||
p.runOnceProbe(probeCtx) | |||
p.runOnceProbe(probeCtx, startCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probeCtx is already tied to startCtx, so you can just use probeCtx.
// exit without saving metrics if the probe was interrupted by context closure | ||
// while running the command | ||
select { | ||
case <-startCtx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented below, you can probably just use ctx here and skip passing startCtx to this method.
@AdamEAnderson Thanks for this PR. I think it makes sense. I added a few comments. |
Hi @AdamEAnderson, just wanted to check if you want to follow up on this PR. |
When a probe is stopped, i.e. via the probe removal process, sometimes the cancellation of the probe context happens while a probe run is in progress. This can cause that probe run to be interrupted mid-execution and return with a failure, which then gets reported through the metrics channel and out to the surfacer as a failed probe, even though the failure was due to probe cancellation rather than an external cause.
This PR adds a check to make sure the context is still live after the probe run is completed but before metrics are emitted. If the context is done, the probe stops without emitting the metrics.
Probe types and their susceptibilities for this issue:
sched
utility to run probes, which is susceptible to this failure.sched
has been updated in the same way ashttp