Skip to content

Avoid emitting probe metrics after context cancellation #793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AdamEAnderson
Copy link
Contributor

When a probe is stopped, i.e. via the probe removal process, sometimes the cancellation of the probe context happens while a probe run is in progress. This can cause that probe run to be interrupted mid-execution and return with a failure, which then gets reported through the metrics channel and out to the surfacer as a failed probe, even though the failure was due to probe cancellation rather than an external cause.

This PR adds a check to make sure the context is still live after the probe run is completed but before metrics are emitted. If the context is done, the probe stops without emitting the metrics.

Probe types and their susceptibilities for this issue:

  • dns
    • dns does not have the issue because the probe run is not interrupted by context closure
  • external (non-server)
    • this does have the issue if the context closes during the process execution as that will forcefully kill the process and result in a failure
  • external (server)
    • this does not have the issue because the metrics reporting is bounded by the context, and because process failures are not emitted as event metrics
  • grpc
    • this does not have the issue as metrics are emitted on the next loop after running probes, which involves checking the context again. Individual target cancellations are also handled because all target removals delete the probe result before the next metrics emission cycle runs.
  • http
    • has this issue, is fixed in the PR. Canceled context can lead to failures.
  • ping
    • ping does not have this issue as the inner probe runner does not respect context closure
  • tcp
    • tcp uses the sched utility to run probes, which is susceptible to this failure. sched has been updated in the same way as http
  • udp
    • udp doesn't have the issue because context cancellation will immediately stop the function that does metrics exporting
  • udp listener
    • udp listener doesn't have the issue because context cancellation will immediately stop the function that does metrics exporting

@AdamEAnderson
Copy link
Contributor Author

test passing, should be ready now

Comment on lines +536 to +543

// exit without exporting stats if the probe run was interrupted.
select {
case <-ctx.Done():
return
default:
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since runProbe is not impacted by context, this is pretty much inconsequential, right? I'd drop ping.go from this change.

@@ -422,7 +430,7 @@ func (p *Probe) runProbe(startCtx context.Context) {
if p.mode == "server" {
p.runServerProbe(probeCtx, startCtx)
} else {
p.runOnceProbe(probeCtx)
p.runOnceProbe(probeCtx, startCtx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probeCtx is already tied to startCtx, so you can just use probeCtx.

// exit without saving metrics if the probe was interrupted by context closure
// while running the command
select {
case <-startCtx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented below, you can probably just use ctx here and skip passing startCtx to this method.

@manugarg
Copy link
Contributor

@AdamEAnderson Thanks for this PR. I think it makes sense. I added a few comments.

@manugarg
Copy link
Contributor

Hi @AdamEAnderson, just wanted to check if you want to follow up on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants