[Serve] Export speculative decoding metrics to OTEL/Prometheus#6184
Open
pperanich wants to merge 1 commit intomodular:mainfrom
Open
[Serve] Export speculative decoding metrics to OTEL/Prometheus#6184pperanich wants to merge 1 commit intomodular:mainfrom
pperanich wants to merge 1 commit intomodular:mainfrom
Conversation
Contributor
|
fyi @AustinDoolittle / @wrongbad |
Contributor
There was a problem hiding this comment.
This change makes sense to me. Thank you!
One request. Can we keep printing the culmulative acceptance rates instead of the per batch acceptance rates? We may move the kvcache cache hit rate to culmulative. Folks have been confused about how to interpret the per batch cache hit rate in cases like chunked prefill.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The speculative decoding pipeline computes acceptance stats and logs them to
console via
pretty_format(), but never publishes them to the telemetrysystem. This PR bridges that gap by adding three OTEL counters at
DETAILEDmetric level:
maxserve.speculative.draft_tokens_generatedmaxserve.speculative.draft_tokens_acceptedmaxserve.speculative.bonus_tokens_used— this field already existed onSpeculativeDecodingMetricsbut was never propagated toBatchMetrics.Adding it required touching the dataclass,
create(), and the testconstructor. It could be dropped deriving it from
the other two counters is preferred.
BatchMetrics.publish_metrics()emits these counters only whendraft_tokens_generated > 0, avoiding noise when speculative decodingis not active.
Additionally,
SpeculativeDecodingMetricsis now reset after each batch inBatchMetrics.create(), matching the existingkv_cache.reset_metrics()pattern. Without this reset,
BatchMetrics.create()reads cumulativelifetime totals rather than per-batch deltas. This was already an issue
for
pretty_format()console logs (which showed ever-growing totalsinstead of per-batch stats), and would cause the new OTEL counters to
double-count since counters are additive.
Note: This changes the existing
pretty_format()console log outputfrom cumulative lifetime stats to per-batch stats (e.g., "Draft Tokens:
15/30" instead of an ever-growing total). Every other metric in the log
line (batch size, throughput, KV cache stats) was already per-batch —
the speculative decoding fields were the only cumulative outlier.
Testing
test_create_resets_speculative_metrics_between_batches— regressiontest that calls
BatchMetrics.create()twice with the sameSpeculativeDecodingMetricsobject. Verifies the second call returnsonly the new batch's values, not cumulative totals. Fails without the
reset()fix (batch 2 reads 30 instead of 10).Checklist
./bazelw run formatto format my changesAssisted-by:trailer in my commit message or this PR descriptionAssisted-by: AI