Skip to content

[Serve] Export speculative decoding metrics to OTEL/Prometheus#6184

Open
pperanich wants to merge 1 commit intomodular:mainfrom
pperanich:spec-decode-metrics
Open

[Serve] Export speculative decoding metrics to OTEL/Prometheus#6184
pperanich wants to merge 1 commit intomodular:mainfrom
pperanich:spec-decode-metrics

Conversation

@pperanich
Copy link

Summary

The speculative decoding pipeline computes acceptance stats and logs them to
console via pretty_format(), but never publishes them to the telemetry
system. This PR bridges that gap by adding three OTEL counters at DETAILED
metric level:

  • maxserve.speculative.draft_tokens_generated
  • maxserve.speculative.draft_tokens_accepted
  • maxserve.speculative.bonus_tokens_used — this field already existed on
    SpeculativeDecodingMetrics but was never propagated to BatchMetrics.
    Adding it required touching the dataclass, create(), and the test
    constructor. It could be dropped deriving it from
    the other two counters is preferred.

BatchMetrics.publish_metrics() emits these counters only when
draft_tokens_generated > 0, avoiding noise when speculative decoding
is not active.

Additionally, SpeculativeDecodingMetrics is now reset after each batch in
BatchMetrics.create(), matching the existing kv_cache.reset_metrics()
pattern. Without this reset, BatchMetrics.create() reads cumulative
lifetime totals rather than per-batch deltas. This was already an issue
for pretty_format() console logs (which showed ever-growing totals
instead of per-batch stats), and would cause the new OTEL counters to
double-count since counters are additive.

Note: This changes the existing pretty_format() console log output
from cumulative lifetime stats to per-batch stats (e.g., "Draft Tokens:
15/30" instead of an ever-growing total). Every other metric in the log
line (batch size, throughput, KV cache stats) was already per-batch —
the speculative decoding fields were the only cumulative outlier.

Testing

  • test_create_resets_speculative_metrics_between_batches — regression
    test that calls BatchMetrics.create() twice with the same
    SpeculativeDecodingMetrics object. Verifies the second call returns
    only the new batch's values, not cumulative totals. Fails without the
    reset() fix (batch 2 reads 30 instead of 10).

Checklist

  • PR is small and focused
  • I ran ./bazelw run format to format my changes
  • I added or updated tests to cover my changes
  • If AI tools assisted with this contribution, I have included an
    Assisted-by: trailer in my commit message or this PR description

Assisted-by: AI

@pperanich pperanich requested a review from a team as a code owner March 16, 2026 13:46
@abduld
Copy link
Contributor

abduld commented Mar 19, 2026

fyi @AustinDoolittle / @wrongbad

Copy link
Contributor

@atomicapple0 atomicapple0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me. Thank you!

One request. Can we keep printing the culmulative acceptance rates instead of the per batch acceptance rates? We may move the kvcache cache hit rate to culmulative. Folks have been confused about how to interpret the per batch cache hit rate in cases like chunked prefill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants