Skip to content

Conversation

@nandini12396
Copy link
Contributor

This change addresses high GC pressure by allocating tiered storage fetch buffers in direct (off-heap) memory instead of the JVM heap. When direct memory is exhausted, the system gracefully falls back to heap allocation with a warning.

Problem:
During tiered storage reads, heap-allocated buffers bypass young generation and go directly to old generation (humongous allocations). Under high read load, these accumulate rapidly and trigger frequent, expensive G1 Old Generation collections causing significant GC pause times.

Solution:

  • Introduced DirectBufferPool that pools direct buffers using WeakReferences, allowing GC to reclaim buffers under memory pressure
  • Modified RemoteLogInputStream to use pooled direct buffers instead of per-request heap allocation
  • Graceful fallback to heap allocation when direct memory is exhausted

@github-actions github-actions bot added triage PRs from the community storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature clients labels Dec 7, 2025
@apoorvmittal10
Copy link
Contributor

Thanks for the PR, I can see new metrics have been introduced in the PR which falls under Monitoring and should require a KIP (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals).

Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandini12396 , thanks for the PR! Some high-level comments:

  1. It looks to me that introducing a thread pool with indirect buffer can fix the GC pressure issue, right? Any other reason we want to use direct buffer?
  2. New configs/metrics needs to go through KIP.

@github-actions github-actions bot removed the triage PRs from the community label Dec 11, 2025
@nandini12396
Copy link
Contributor Author

nandini12396 commented Dec 11, 2025

Hi @showuon thanks for your review!
Pooling heap buffers would reduce allocation frequency but doesn't eliminate GC pressure. In G1GC, objects >32MB (half region size) are "humongous" and skip young generation entirely—they go straight to old gen. Even with pooling, these buffers:

  • Get scanned during every GC cycle (even if reused)
  • Contribute to heap occupancy that triggers GC
  • Can only be collected in expensive mixed/full GCs.

In tiered storage, maxBytes can reach 55MB+ based on replica.fetch.max.bytes and replica.fetch.response.max.bytes. With a 4GB heap at IHOP=35%, just ~25 concurrent fetches (25 × 55MB = 1.375GB) trigger old GC.

Direct buffers move the data off-heap entirely into native memory where GC doesn't see it. We also get zero-copy I/O since the data is already in native memory for socket writes.

To share some test results
Heap buffers: Direct buffers:

  • GC every ~100ms - GC every 30-40s
  • 1.1-1.3GB after GC - 325MB after GC
  • 546-689 humongous - ~270 humongous (50% reduction)
    regions regions

I will remove the metrics changes from this PR so we can focus on the buffer pool implementation.

@nandini12396 nandini12396 force-pushed the KAFKA-19967-reduce-gc-tiered-storage-direct-memory branch from 9a49d38 to d2592f2 Compare December 15, 2025 13:20
@nandini12396
Copy link
Contributor Author

I've updated the PR without the metrics changes to focus on fixing the issue. Please could you take a look and review. Thank you!

Copy link
Contributor

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default consumer configs are max.partition.fetch.bytes = 1 MB and fetch.max.bytes = 50 MB. And, if the size of the messages in the topic are ~1 MB, then the heap buffer size of 1 MB might be fine for those cases.

We may have to provide another config to use either direct / heap buffer when enabling the buffer pool. Since using direct buffer might create page faults, and potentially impact the produce latencies.

Thanks for the patch!

MEDIUM,
REMOTE_LIST_OFFSETS_REQUEST_TIMEOUT_MS_DOC);
REMOTE_LIST_OFFSETS_REQUEST_TIMEOUT_MS_DOC)
.define(REMOTE_LOG_DIRECT_BUFFER_POOL_ENABLED_PROP,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you mark the config as internal until the KIP gets approved?

.define -> .defineInternal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@nandini12396 nandini12396 force-pushed the KAFKA-19967-reduce-gc-tiered-storage-direct-memory branch from d2592f2 to 6da88aa Compare December 16, 2025 22:06
…ect memory buffers

This change addresses high GC pressure by allocating tiered storage fetch buffers in direct (off-heap) memory instead of the JVM heap. When direct memory is exhausted, the system gracefully falls back to heap allocation with a warning.

Problem:
During tiered storage reads, heap-allocated buffers bypass young generation and go directly to old generation (humongous allocations). Under high read load, these accumulate rapidly and trigger frequent, expensive G1 Old Generation collections causing significant GC pause times.

Solution:
- Introduced DirectBufferPool that pools direct buffers using WeakReferences, allowing GC to reclaim buffers under memory pressure
- Modified RemoteLogInputStream to use pooled direct buffers instead of per-request heap allocation
- Graceful fallback to heap allocation when direct memory is exhausted
@nandini12396 nandini12396 force-pushed the KAFKA-19967-reduce-gc-tiered-storage-direct-memory branch from 6da88aa to 343a450 Compare December 16, 2025 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clients storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants