Skip to content

daemon: instrument with more tracing; plumb more contexts#51967

Open
corhere wants to merge 5 commits intomoby:masterfrom
corhere:trace-the-rest-of-the-owl
Open

daemon: instrument with more tracing; plumb more contexts#51967
corhere wants to merge 5 commits intomoby:masterfrom
corhere:trace-the-rest-of-the-owl

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jan 30, 2026

- What I did

  • Added tracing spans to the event handlers, callback functions and background goroutines so all the operations related to that operation are correlated into a single trace, like we already have with API endpoint handlers.
  • Instrumented libnetwork with even more tracing spans to help with debugging.
  • Plumbed contexts all over the place to propagate trace context all through the call stack.

- How I did it

  • grepping for context.TODO()
  • adding context arguments to the Backend interfaces in daemon/server/... and iterating on adding context arguments to functions until it passes go vet got me surprisingly far, quickly.

- How to verify it
Point a daemon at an OTEL ingestion pipeline and see what happens.
Screenshot 2026-01-29 at 11 45 10 PM
Screenshot 2026-01-29 at 11 44 17 PM

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

Add spans to functions at the "root" of an event to logically group
spans into a single trace per API call, Swarmkit callback, event
handled, or background operation. And trace lots of interesting
libnetwork functions to assist in debugging.

Signed-off-by: Cory Snider <[email protected]>
Lots of interesting setup happens during the bridge and overlay driver
registration. Rough-in context plumbing for more correlated tracing of
driver initialization.

Signed-off-by: Cory Snider <[email protected]>
@corhere corhere added this to the 29.2.1 milestone Jan 30, 2026
@corhere corhere added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/daemon Core Engine labels Jan 30, 2026
Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass, my main concerns are around not handling cancellations properly during cleanups

return err
}
defer rwLayer.Release()
defer rwLayer.Release(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could leak layer if ctx is cancelled.

Suggested change
defer rwLayer.Release(ctx)
defer rwLayer.Release(context.WithoutCancel(ctx))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := rl.c.LeasesService().Delete(context.WithoutCancel(ctx), lease); err != nil {

fs, err := newLayer.Mount(ctx, "")
if err != nil {
rwLayer.Release()
rwLayer.Release(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rwLayer.Release(ctx)
rwLayer.Release(context.WithoutCancel(ctx))

archive, err := rwlayer.TarStream()
if err != nil {
_ = rwlayer.Unmount()
_ = rwlayer.Unmount(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ = rwlayer.Unmount(ctx)
_ = rwlayer.Unmount(context.WithoutCancel(ctx))

return ioutils.NewReadCloserWrapper(archive, func() error {
_ = archive.Close()
err := rwlayer.Unmount()
err := rwlayer.Unmount(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := rwlayer.Unmount(ctx)
err := rwlayer.Unmount(context.WithoutCancel(ctx))

err := data.Close()
container.DetachAndUnmount(daemon.LogVolumeEvent)
daemon.Unmount(container)
daemon.Unmount(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel?

return err
}
defer daemon.Unmount(container)
defer daemon.Unmount(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel?

container.DetachAndUnmount(daemon.LogVolumeEvent)
// unmount the container's rootfs
daemon.Unmount(container)
daemon.Unmount(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel?

err := archv.Close()
container.DetachAndUnmount(daemon.LogVolumeEvent)
daemon.Unmount(container)
daemon.Unmount(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel?


// DaemonLeavesCluster informs the daemon has left the cluster
func (daemon *Daemon) DaemonLeavesCluster() {
func (daemon *Daemon) DaemonLeavesCluster(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need WithoutCancel for this whole method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, it is desirable for DaemonLeavesCluster calls to be cancelable. This function is called both when the daemon leaves a Swarm cluster and when a clustered daemon shuts down. The latter runs under a deadline.

// shutdownDaemon just wraps daemon.Shutdown() to handle a timeout in case
// d.Shutdown() is waiting too long to kill container or worst it's
// blocked there
func shutdownDaemon(ctx context.Context, d *daemon.Daemon) {
var cancel context.CancelFunc
if timeout := d.ShutdownTimeout(); timeout >= 0 {
ctx, cancel = context.WithTimeout(ctx, time.Duration(timeout)*time.Second)
} else {
ctx, cancel = context.WithCancel(ctx)
}
go func() {
defer cancel()
d.Shutdown(ctx)
}()
<-ctx.Done()
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
log.G(ctx).Error("Force shutdown daemon")
} else {
log.G(ctx).Debug("Clean shutdown succeeded")
}
}

Draw the rest of the owl. Wire up contexts from the API handlers,
Swarmkit callbacks, event handler callbacks, and background goroutines
all the way into libnetwork and the containerd client.

Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the trace-the-rest-of-the-owl branch from 30467c7 to 14e72ee Compare January 30, 2026 21:08
}
defer func() {
if err := rwl.Unmount(); err != nil {
if err := rwl.Unmount(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

content = ioutils.NewReadCloserWrapper(data, func() error {
err := data.Close()
_ = cfs.Close()
_ = cfs.Close(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

return err
}
defer cfs.Close()
defer cfs.Close(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

return err
}
defer daemon.Unmount(ctr)
defer daemon.Unmount(ctx, ctr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the underlying code doesn't actually respect cancellation currently, but I think it's good to be explicit about the intent.

Suggested change
defer daemon.Unmount(ctx, ctr)
defer daemon.Unmount(context.WithoutCancel(ctx), ctr)

logger := log.G(ctx).WithField("container", c.ID)

rwlayer, err := daemon.imageService.GetLayerByID(c.ID)
rwlayer, err := daemon.imageService.GetLayerByID(context.Background(), c.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a ctx here we can use:

Suggested change
rwlayer, err := daemon.imageService.GetLayerByID(context.Background(), c.ID)
rwlayer, err := daemon.imageService.GetLayerByID(ctx, c.ID)

logger(c).WithError(err).Warn("failed to mount container to get BaseFs path")
} else {
if err := daemon.Unmount(c); err != nil {
if err := daemon.Unmount(ctx, c); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

return err
}
defer daemon.Unmount(container)
defer daemon.Unmount(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

return nil, err
}
defer daemon.Unmount(c)
defer daemon.Unmount(ctx, c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

defer func() {
if retErr != nil {
err = daemon.cleanupContainer(ctr, backend.ContainerRmConfig{
err = daemon.cleanupContainer(ctx, ctr, backend.ContainerRmConfig{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithoutCancel

Comment on lines +171 to +172
if err := daemon.removeMountPoints(ctx, ctr, config.RemoveVolume); err != nil {
log.G(ctx).Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually return on error (only log), so we should either:

  1. WithoutCancel
  2. Return error

Looks like all other context-aware calls in this method are actually using WithoutCancel already, perhaps we should just make the whole ctx uncancellable:

// at the top
ctx = context.WithoutCancel(ctx)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it's fine for the cleanupContainer itself to be able to cancel in the middle, and let the caller decide (AFAIK most of the calls pass an uncancellable context already)?

Not sure where the boundary is tbh.

@vvoland vvoland modified the milestones: 29.2.1, 29.3.0 Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/builder/buildkit Build area/builder/classic-builder Build area/builder Build area/daemon Core Engine area/images Image Distribution area/networking Networking area/swarm area/testing area/volumes Volumes containerd-integration Issues and PRs related to containerd integration kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. platform/windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants