[Serve] Export speculative decoding metrics to OTEL/Prometheus by pperanich · Pull Request #6184 · modular/modular

pperanich · 2026-03-16T13:46:47Z

Summary

The speculative decoding pipeline computes acceptance stats and logs them to
console via pretty_format(), but never publishes them to the telemetry
system. This PR bridges that gap by adding three OTEL counters at DETAILED
metric level:

maxserve.speculative.draft_tokens_generated
maxserve.speculative.draft_tokens_accepted
maxserve.speculative.bonus_tokens_used — this field already existed on
SpeculativeDecodingMetrics but was never propagated to BatchMetrics.
Adding it required touching the dataclass, create(), and the test
constructor. It could be dropped deriving it from
the other two counters is preferred.

BatchMetrics.publish_metrics() emits these counters only when
draft_tokens_generated > 0, avoiding noise when speculative decoding
is not active.

Additionally, SpeculativeDecodingMetrics is now reset after each batch in
BatchMetrics.create(), matching the existing kv_cache.reset_metrics()
pattern. Without this reset, BatchMetrics.create() reads cumulative
lifetime totals rather than per-batch deltas. This was already an issue
for pretty_format() console logs (which showed ever-growing totals
instead of per-batch stats), and would cause the new OTEL counters to
double-count since counters are additive.

Note: This changes the existing pretty_format() console log output
from cumulative lifetime stats to per-batch stats (e.g., "Draft Tokens:
15/30" instead of an ever-growing total). Every other metric in the log
line (batch size, throughput, KV cache stats) was already per-batch —
the speculative decoding fields were the only cumulative outlier.

Testing

test_create_resets_speculative_metrics_between_batches — regression
test that calls BatchMetrics.create() twice with the same
SpeculativeDecodingMetrics object. Verifies the second call returns
only the new batch's values, not cumulative totals. Fails without the
reset() fix (batch 2 reads 30 instead of 10).

Checklist

PR is small and focused
I ran ./bazelw run format to format my changes
I added or updated tests to cover my changes
If AI tools assisted with this contribution, I have included an
Assisted-by: trailer in my commit message or this PR description

Assisted-by: AI

abduld · 2026-03-19T03:08:52Z

fyi @AustinDoolittle / @wrongbad

atomicapple0

This change makes sense to me. Thank you!

One request. Can we keep printing the culmulative acceptance rates instead of the per batch acceptance rates? We may move the kvcache cache hit rate to culmulative. Folks have been confused about how to interpret the per batch cache hit rate in cases like chunked prefill.

[Serve] Export speculative decoding metrics to OTEL/Prometheus

6583145

pperanich requested a review from a team as a code owner March 16, 2026 13:46

github-actions bot added the waiting-on-review label Mar 16, 2026

AustinDoolittle requested a review from atomicapple0 March 20, 2026 18:14

atomicapple0 reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Export speculative decoding metrics to OTEL/Prometheus#6184

[Serve] Export speculative decoding metrics to OTEL/Prometheus#6184
pperanich wants to merge 1 commit intomodular:mainfrom
pperanich:spec-decode-metrics

pperanich commented Mar 16, 2026

Uh oh!

abduld commented Mar 19, 2026

Uh oh!

atomicapple0 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pperanich commented Mar 16, 2026

Summary

Testing

Checklist

Uh oh!

abduld commented Mar 19, 2026

Uh oh!

atomicapple0 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

atomicapple0 left a comment •

edited

Loading