Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-filter spans for distributed tracing #37339

Open
kosamson opened this issue Nov 25, 2024 · 7 comments
Open

Add per-filter spans for distributed tracing #37339

kosamson opened this issue Nov 25, 2024 · 7 comments
Labels
area/tracing enhancement Feature requests. Not bugs or questions.

Comments

@kosamson
Copy link

kosamson commented Nov 25, 2024

Title: Add per-filter spans for distributed tracing

Description:
Certain API Gateway products enable debugging transactions as they are processed by the data plane at a granular policy or filter level. One example is Google's Apigee Edge which supports the Trace tool. The Apigee Trace tool allows for troubleshooting client-to-Apigee, Apigee-to-target, and intra-Apigee (between Apigee policies) transaction flows.

In Envoy, the distributed tracing filter(s) allow for the two former scenarios, but I do not see support for tracing intra-Envoy flows such as a transaction being processed by a series of L4/L7 filters. As of now, I see that Envoy emits one span for a transaction passing through the data plane to an upstream cluster and potentially more spans for remote service calls as part of an HTTP filter such as External Authorization or Rate Limiting.

It would be great if Envoy could emit more granular (internal) spans for each filter processed as part of a single transaction. This would enable easier troubleshooting in scenarios such as determining which synchronous filter(s) are adding latency to the overall transaction. For the most granular data, this might require each filter to implement its own tracing logic. However, it could be useful to generalize some tracing behavior managed by the Envoy worker thread across all L4/L7 filters such that we can at least get processing duration data for each filter.

@kosamson kosamson added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Nov 25, 2024
@kosamson
Copy link
Author

kosamson commented Nov 25, 2024

An alternative I have used to troubleshoot potential intra-Envoy latency issues is the %COMMON_DURATION(START:END:PRECISION)% command operator in Envoy access logging. This helps narrow down if latency is introduced by a downstream/upstream peer or by Envoy itself. However, this does not provide enough granular information as to what might be causing latency if it is caused by Envoy.

@kosamson
Copy link
Author

kosamson commented Nov 25, 2024

To have closer parity to the Apigee Trace tool, maybe it might be worth considering adding this per-filter data to the Tap filter? It would be ideal if for each filter, we can see both:

  1. Overall processing duration (useful for troubleshooting transaction latency)
  2. Request/response data (headers, payload, and trailers) before/after the filter processes the transaction (useful for troubleshooting malformed requests/responses impacting business logic)

@RyanTheOptimist RyanTheOptimist added area/tracing and removed triage Issue requires triage labels Nov 26, 2024
@RyanTheOptimist
Copy link
Contributor

@wbpcode

@wbpcode
Copy link
Member

wbpcode commented Nov 27, 2024

Nothing is free, more detailed tracing means more overhead in the key path and more complex code base, I think we cannot accept further decrease of the envoy performance.

And in most cases, most users needn't this because the filter of envoy basically is super fast. And if there is performance problem, perf would be more useful than the filter level tracing.

Of course, I am not mean the feature is senseless, I only mean it may doesn't deserve the overhead and investment. I think the best althernative would be the ExecutionContext.

The ExecutionContext provides very fine grained probes (include filter level) that could be extended to record the overhead of per filter. By default, the ExecutionContext is compiled out but it's easy to enable it at the third-party building if it's required.

@kosamson
Copy link
Author

kosamson commented Nov 27, 2024

Nothing is free, more detailed tracing means more overhead in the key path and more complex code base, I think we cannot accept further decrease of the envoy performance.

And in most cases, most users needn't this because the filter of envoy basically is super fast. And if there is performance problem, perf would be more useful than the filter level tracing.

Of course, I am not mean the feature is senseless, I only mean it may doesn't deserve the overhead and investment. I think the best althernative would be the ExecutionContext.

The ExecutionContext provides very fine grained probes (include filter level) that could be extended to record the overhead of per filter. By default, the ExecutionContext is compiled out but it's easy to enable it at the third-party building if it's required.

Thank you for the reply, I appreciate the feedback and information on this feature suggestion.

If more detailed tracing will add more overhead/latency to the key data plane path, then I agree that it probably is not worth adding (unless it was in some sort of "debug" binary separate from the standard Envoy binary).

I am not familiar with the perf and ExecutionContext features/components, but let me do some research to see if this would satisfy what I am looking for.

For ExecutionContext, I found this GitHub issue which explains it in detail, but I am not sure what you mean by perf. Did you mean to refer to envoy-perf?

@kosamson
Copy link
Author

And in most cases, most users needn't this because the filter of envoy basically is super fast. And if there is performance problem, perf would be more useful than the filter level tracing.

For the standard Envoy filters this is totally reasonable, but my concern is for custom filters built in-house by Envoy operators or by Envoy-based vendors (Solo.io, Tetrate, etc). It is probably still more correct to use envoy-perf and ExecutionContext to debug this in an isolated test/debug environment. However, for hard-to-reproduce production issues, not having the detailed trace available for impacted transactions live in production is a minus.

@wbpcode
Copy link
Member

wbpcode commented Nov 27, 2024

re to perf

I mean the linux perf tool 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracing enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

3 participants