KAFKA-19968: Fix tiered storage quota enforcement issues #21108

nandini12396 · 2025-12-08T12:23:37Z

###Problem:

Race condition and quota leaks: Multiple threads could check quota before any recorded usage, allowing all to bypass limits simultaneously. Additionally, in multi-partition fetches, quota was reserved per-partition but could leak if some partitions failed or were throttled, leading to quota exhaustion over time.
Startup race condition: RemoteLogManager initialized with default quotas (Long.MAX_VALUE = unlimited) and relied on dynamic config updates to apply correct values, creating a window where operations could exceed configured quotas.

###Solution:

Atomic quota reservation
- Added RLMQuotaManager.recordAndGetThrottleTimeMs() to atomically record usage and check quota in a single synchronized operation
- Added quotaReservedBytes field to RemoteStorageFetchInfo to track per-partition reservations
- Modified ReplicaManager to call recordAndCheckFetchQuota() BEFORE dispatching remote fetch, ensuring quota is reserved atomically based on adjustedMaxBytes
- If throttled, immediately release the reservation since fetch won't execute
- RemoteLogReader adjusts quota using delta (actual - reserved) after fetch completes
- On error, releases the full reservation to prevent leaks
Eager startup quota initialization
- Ensures quotas are correct before broker starts serving requests
- Added BrokerServer.applyRemoteLogQuotas() to eagerly apply quota configs immediately after RemoteLogManager creation

Problem: 1. Race condition and quota leaks: Multiple threads could check quota before any recorded usage, allowing all to bypass limits simultaneously. Additionally, in multi-partition fetches, quota was reserved per-partition but could leak if some partitions failed or were throttled, leading to quota exhaustion over time. 2. Startup race condition: RemoteLogManager initialized with default quotas (Long.MAX_VALUE = unlimited) and relied on dynamic config updates to apply correct values, creating a window (100ms-5s) where operations could exceed configured quotas. Solution: 1. Atomic quota reservation - Added RLMQuotaManager.recordAndGetThrottleTimeMs() to atomically record usage and check quota in a single synchronized operation - Added quotaReservedBytes field to RemoteStorageFetchInfo to track per-partition reservations - Modified ReplicaManager to call recordAndCheckFetchQuota() BEFORE dispatching remote fetch, ensuring quota is reserved atomically based on adjustedMaxBytes - If throttled, immediately release the reservation since fetch won't execute - RemoteLogReader adjusts quota using delta (actual - reserved) after fetch completes - On error, releases the full reservation to prevent leaks 2. Eager startup quota initialization - Ensures quotas are correct before broker starts serving requests - Added BrokerServer.applyRemoteLogQuotas() to eagerly apply quota configs immediately after RemoteLogManager creation

showuon · 2025-12-10T06:48:41Z

@abhijeetk88 @kamalcph @satishd , I think the issue is valid. WDYT?

github-actions · 2025-12-16T03:35:48Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

kamalcph · 2025-12-16T09:18:02Z

Thanks for the PR! Went over first pass, have below questions:

Startup race condition: RemoteLogManager initialized with default quotas (Long.MAX_VALUE = unlimited) and relied on dynamic config updates to apply correct values, creating a window where operations could exceed configured quotas.

Could you add a unit test to cover this case?

Multiple threads could check quota before any recorded usage, allowing all to bypass limits simultaneously.

Also, cover this case with a unit test to understand the issue.

Modified ReplicaManager to call recordAndCheckFetchQuota() BEFORE dispatching remote fetch,

If the remoteFetchQuotaBytesPerSecond is set to 25 Mbps and there is a message with 30 MB size, will the consumer get stuck? The previous behaviour was to allow the consumption to continue.

On error, releases the full reservation to prevent leaks

How can Kafka differentiate whether the error is from remote storage or there is an error in processing the response? I think this is a good improvement. Usually, we don't expect errors in RLMM or any Kafka components while processing the response.

nandini12396 · 2025-12-16T21:54:05Z

Thanks so much for the review and reply! I've updated the PR with the tests.

If the remoteFetchQuotaBytesPerSecond is set to 25 Mbps and there is a message with 30 MB size, will the consumer get stuck? The previous behaviour was to allow the consumption to continue.

No. The behavior follows Kafka's standard quota handling pattern used in other fetch paths - we allow at least one fetch to proceed even if it exceeds the quota, but subsequent requests will be throttled to bring the average back within limits.

How can Kafka differentiate whether the error is from remote storage or there is an error in processing the response?

You're right that this is an improvement worth discussing. Currently, the implementation releases the quota reservation on any error during remote fetch processing. In practice, most errors happen after bandwidth usage? We could refine this to only release on specific error types. What's your preference?

github-actions · 2025-12-18T03:34:37Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

github-actions bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature labels Dec 8, 2025

github-actions bot added the needs-attention label Dec 16, 2025

add tests

c5ea0e6

github-actions bot removed the needs-attention label Dec 17, 2025

github-actions bot added the needs-attention label Dec 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19968: Fix tiered storage quota enforcement issues #21108

KAFKA-19968: Fix tiered storage quota enforcement issues #21108

Uh oh!

nandini12396 commented Dec 8, 2025 •

edited

Loading

Uh oh!

showuon commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

kamalcph commented Dec 16, 2025

Uh oh!

nandini12396 commented Dec 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KAFKA-19968: Fix tiered storage quota enforcement issues #21108

Are you sure you want to change the base?

KAFKA-19968: Fix tiered storage quota enforcement issues #21108

Uh oh!

Conversation

nandini12396 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

showuon commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

kamalcph commented Dec 16, 2025

Uh oh!

nandini12396 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nandini12396 commented Dec 8, 2025 •

edited

Loading

nandini12396 commented Dec 16, 2025 •

edited

Loading