Add per-filter spans for distributed tracing #37339

kosamson · 2024-11-25T03:34:38Z

Title: Add per-filter spans for distributed tracing

Description:
Certain API Gateway products enable debugging transactions as they are processed by the data plane at a granular policy or filter level. One example is Google's Apigee Edge which supports the Trace tool. The Apigee Trace tool allows for troubleshooting client-to-Apigee, Apigee-to-target, and intra-Apigee (between Apigee policies) transaction flows.

In Envoy, the distributed tracing filter(s) allow for the two former scenarios, but I do not see support for tracing intra-Envoy flows such as a transaction being processed by a series of L4/L7 filters. As of now, I see that Envoy emits one span for a transaction passing through the data plane to an upstream cluster and potentially more spans for remote service calls as part of an HTTP filter such as External Authorization or Rate Limiting.

It would be great if Envoy could emit more granular (internal) spans for each filter processed as part of a single transaction. This would enable easier troubleshooting in scenarios such as determining which synchronous filter(s) are adding latency to the overall transaction. For the most granular data, this might require each filter to implement its own tracing logic. However, it could be useful to generalize some tracing behavior managed by the Envoy worker thread across all L4/L7 filters such that we can at least get processing duration data for each filter.

kosamson · 2024-11-25T03:50:43Z

An alternative I have used to troubleshoot potential intra-Envoy latency issues is the %COMMON_DURATION(START:END:PRECISION)% command operator in Envoy access logging. This helps narrow down if latency is introduced by a downstream/upstream peer or by Envoy itself. However, this does not provide enough granular information as to what might be causing latency if it is caused by Envoy.

kosamson · 2024-11-25T03:53:37Z

To have closer parity to the Apigee Trace tool, maybe it might be worth considering adding this per-filter data to the Tap filter? It would be ideal if for each filter, we can see both:

Overall processing duration (useful for troubleshooting transaction latency)
Request/response data (headers, payload, and trailers) before/after the filter processes the transaction (useful for troubleshooting malformed requests/responses impacting business logic)

RyanTheOptimist · 2024-11-26T16:10:02Z

@wbpcode

wbpcode · 2024-11-27T06:28:18Z

Nothing is free, more detailed tracing means more overhead in the key path and more complex code base, I think we cannot accept further decrease of the envoy performance.

And in most cases, most users needn't this because the filter of envoy basically is super fast. And if there is performance problem, perf would be more useful than the filter level tracing.

Of course, I am not mean the feature is senseless, I only mean it may doesn't deserve the overhead and investment. I think the best althernative would be the ExecutionContext.

The ExecutionContext provides very fine grained probes (include filter level) that could be extended to record the overhead of per filter. By default, the ExecutionContext is compiled out but it's easy to enable it at the third-party building if it's required.

kosamson · 2024-11-27T06:41:01Z

Nothing is free, more detailed tracing means more overhead in the key path and more complex code base, I think we cannot accept further decrease of the envoy performance.

And in most cases, most users needn't this because the filter of envoy basically is super fast. And if there is performance problem, perf would be more useful than the filter level tracing.

Of course, I am not mean the feature is senseless, I only mean it may doesn't deserve the overhead and investment. I think the best althernative would be the ExecutionContext.

The ExecutionContext provides very fine grained probes (include filter level) that could be extended to record the overhead of per filter. By default, the ExecutionContext is compiled out but it's easy to enable it at the third-party building if it's required.

Thank you for the reply, I appreciate the feedback and information on this feature suggestion.

If more detailed tracing will add more overhead/latency to the key data plane path, then I agree that it probably is not worth adding (unless it was in some sort of "debug" binary separate from the standard Envoy binary).

I am not familiar with the perf and ExecutionContext features/components, but let me do some research to see if this would satisfy what I am looking for.

For ExecutionContext, I found this GitHub issue which explains it in detail, but I am not sure what you mean by perf. Did you mean to refer to envoy-perf?

kosamson · 2024-11-27T06:49:15Z

And in most cases, most users needn't this because the filter of envoy basically is super fast. And if there is performance problem, perf would be more useful than the filter level tracing.

For the standard Envoy filters this is totally reasonable, but my concern is for custom filters built in-house by Envoy operators or by Envoy-based vendors (Solo.io, Tetrate, etc). It is probably still more correct to use envoy-perf and ExecutionContext to debug this in an isolated test/debug environment. However, for hard-to-reproduce production issues, not having the detailed trace available for impacted transactions live in production is a minus.

wbpcode · 2024-11-27T07:19:40Z

re to perf

I mean the linux perf tool 🤣

kosamson added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Nov 25, 2024

RyanTheOptimist added area/tracing and removed triage Issue requires triage labels Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-filter spans for distributed tracing #37339

Add per-filter spans for distributed tracing #37339

kosamson commented Nov 25, 2024 •

edited

Loading

kosamson commented Nov 25, 2024 •

edited

Loading

kosamson commented Nov 25, 2024 •

edited

Loading

RyanTheOptimist commented Nov 26, 2024

wbpcode commented Nov 27, 2024 •

edited

Loading

kosamson commented Nov 27, 2024 •

edited

Loading

kosamson commented Nov 27, 2024

wbpcode commented Nov 27, 2024

Add per-filter spans for distributed tracing #37339

Add per-filter spans for distributed tracing #37339

Comments

kosamson commented Nov 25, 2024 • edited Loading

kosamson commented Nov 25, 2024 • edited Loading

kosamson commented Nov 25, 2024 • edited Loading

RyanTheOptimist commented Nov 26, 2024

wbpcode commented Nov 27, 2024 • edited Loading

kosamson commented Nov 27, 2024 • edited Loading

kosamson commented Nov 27, 2024

wbpcode commented Nov 27, 2024

kosamson commented Nov 25, 2024 •

edited

Loading

kosamson commented Nov 25, 2024 •

edited

Loading

kosamson commented Nov 25, 2024 •

edited

Loading

wbpcode commented Nov 27, 2024 •

edited

Loading

kosamson commented Nov 27, 2024 •

edited

Loading