Skip to content

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899

Open
mpozniak95 wants to merge 3 commits intoredis:unstablefrom
mpozniak95:mget_prefetch_framework
Open

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899
mpozniak95 wants to merge 3 commits intoredis:unstablefrom
mpozniak95:mget_prefetch_framework

Conversation

@mpozniak95
Copy link

@mpozniak95 mpozniak95 commented Mar 17, 2026

Motivation

The existing memory_prefetch.c framework amortizes dictionary lookup cache misses across commands in I/O-thread batches, but provides no mechanism for a single multi-key command to prefetch its own keys. When a command like MGET key1 ... key100 executes, each lookupKeyRead() triggers a chain of cache misses (hash bucket → entry → kvobj → value data). With 5M+ keys, nearly every access is an L3/DRAM miss. For MGET 100 keys this means ~400 sequential cache misses with the CPU stalling on each one.

This PR refactors the prefetch state machine to be reusable and applies it inside mgetCommand as the first consumer.

Changes

1. New DictPrefetchCtx abstraction (memory_prefetch.c)

Extracted a lightweight context struct from the global PrefetchCommandsBatch. All state machine functions (prefetchBucket, prefetchEntry, prefetchKVObject, prefetchValueData) were refactored to operate on a DictPrefetchCtx * parameter instead of accessing the global batch directly. The existing cross-command I/O-thread path creates a DictPrefetchCtx from the global batch fields — behavior is fully backward-compatible.

2. New public API (memory_prefetch.h)

void dictPrefetchKeys(struct dict **dicts, void **keys, size_t nkeys,
                      PrefetchGetValueDataFunc get_val_data);
void *prefetchGetObjectValuePtr(const void *val);

dictPrefetchKeys() stack-allocates KeyPrefetchInfo[nkeys] and runs the same 5-state round-robin machine (BUCKET → ENTRY → KVOBJ → VALDATA → DONE). Zero heap allocations in the hot path.

3. Cross-command awareness (memory_prefetch.c + server.h)

Added PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. When addCommandToBatch() successfully adds all keys of a command to the cross-command batch, it sets this flag. mgetCommand checks the flag and skips intra-command prefetching when keys are already warm — avoiding redundant work when I/O threads have already prefetched everything.

4. Optimized mgetCommand (t_string.c)

Processes keys in batches of 16 with three fast paths:

  • Single key: direct lookup, zero overhead
  • Empty dict: NULL-reply loop with lookupKeyRead() for side effects
  • Cross-command prefetched: simple loop when I/O threads already warmed the cache

Benchmark results

Tested on r8i.4xlarge ubuntu24.04 with redis-benchmarks-specification memtier suites. best of 3 iterations.

Single-threaded (no I/O threads)

Test DB size Value size Main (ops/s) Optimized (ops/s) Δ
MGET 100 keys 5M 100B 26,325 39,573 +50%
MGET 30 keys, P=10 5M 100B 84,181 122,095 +45%
MGET 10 keys 5M 100B 119,446 147,563 +24%
MGET 30 keys 5M 100B 74,634 91,602 +23%
MGET 20 keys 5M 100B 95,328 112,204 +18%
MGET 5 keys 5M 512B 134,749 155,004 +15%
MGET 5 keys 5M 1KiB 127,420 145,808 +14%
MGET 5 keys, P=10 5M 512B 280,572 314,293 +12%
MGET 5 keys 5M 100B 155,014 170,170 +10%
MGET 2 keys 5M 100B 174,118 189,762 +9%
MGET 5 keys, P=10 5M 1KiB 133,109 145,478 +9%
MGET 5 keys 1M 1KiB 119,035 126,642 +6%
MGET 2 keys, P=10 5M 100B 885,126 927,125 +5%
MGET 100 keys, P=10 5M 100B 49,941 51,870 +4%
MGET 5 keys, P=10 5M 100B 478,288 494,179 +3%

Single-threaded: +3% to +50% across all tests, zero regressions.

4 I/O threads

Test DB size Value size Main (ops/s) Optimized (ops/s) Δ
MGET 30 keys, P=10 5M 100B 102,871 173,263 +68%
MGET 100 keys, P=10 5M 100B 28,133 44,083 +57%
MGET 5 keys, P=10 5M 100B 720,426 843,673 +17%
MGET 10 keys 5M 100B 376,988 413,916 +10%
MGET 2 keys 5M 100B 474,214 506,097 +7%
MGET 2 keys, P=10 5M 100B 1,989,495 2,113,546 +6%
MGET 5 keys 5M 512B 453,154 467,093 +3%
MGET 30 keys 5M 100B 207,203 211,638 +2%
MGET 5 keys, P=10 5M 512B 522,417 530,606 +2%
MGET 5 keys 5M 100B 459,781 466,000 +1%
MGET 5 keys, P=10 5M 1KiB 298,161 298,208 ~0%
MGET 5 keys 5M 1KiB 298,122 298,262 ~0%
HMGET 5 fields, P=10 1M 100B 2,042,931 2,017,445 -1%
MGET 5 keys 1M 1KiB 135,522 131,678 -3%
MGET 20 keys 5M 100B 261,383 252,574 -3%
MGET 100 keys 5M 100B 77,969 69,382 -11%

4 I/O threads: up to +68% on pipelined workloads. Regressions on non-pipelined high-key-count tests (20, 100 keys) possibly due to partial overlap with cross-command prefetching.

16 I/O threads

Test DB size Value size Main (ops/s) Optimized (ops/s) Δ
MGET 30 keys, P=10 5M 100B 114,994 179,161 +56%
MGET 100 keys, P=10 5M 100B 31,648 48,894 +54%
MGET 5 keys, P=10 5M 512B 484,134 564,920 +17%
MGET 5 keys, P=10 5M 100B 533,706 612,046 +15%
MGET 5 keys 5M 100B 418,656 432,231 +3%
MGET 5 keys 5M 512B 406,153 416,785 +3%
MGET 10 keys 5M 100B 323,832 331,224 +2%
MGET 2 keys, P=10 5M 100B 1,491,821 1,525,486 +2%
MGET 2 keys 5M 100B 522,589 524,531 ~0%
HMGET 5 fields, P=10 1M 100B 1,715,306 1,722,662 ~0%
MGET 5 keys 5M 1KiB 298,251 298,136 ~0%
MGET 5 keys, P=10 5M 1KiB 298,381 298,229 ~0%
MGET 30 keys 5M 100B 163,822 157,660 -4%
MGET 5 keys 1M 1KiB 136,597 130,641 -4%
MGET 20 keys 5M 100B 209,638 198,278 -5%
MGET 100 keys 5M 100B 65,570 59,888 -9%

16 I/O threads: up to +56% on pipelined workloads. Same regression pattern as 4 I/O threads on non-pipelined high-key-count tests.


Note

Medium Risk
Touches performance-critical key-lookup paths by refactoring the prefetch state machine and changing mgetCommand execution flow; includes new stack-based per-batch state and a new pending-command flag that could affect prefetch behavior under load.

Overview
Refactors the dict-lookup prefetch state machine to run off a new DictPrefetchCtx, and exposes it as a public dictPrefetchKeys() API (plus prefetchGetObjectValuePtr() callback) so a single multi-key command can prefetch its own keys.

Optimizes MGET to prefetch keys in batches (default 16) before sequential lookupKeyRead() calls, with fast paths for single-key/empty-dict cases and a skip when keys were already warmed by cross-command batching.

Adds PENDING_CMD_KEYS_PREFETCHED and updates addCommandToBatch() to set it only when all keys for a pending command were included in the cross-command prefetch batch, avoiding redundant intra-command prefetching when batching partially fills.

Written by Cursor Bugbot for commit b0a5456. This will update automatically on new commits. Configure here.

…keys

Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in
addCommandToBatch() when a command is added to the cross-command
prefetch batch. In mgetCommand, check the flag and fall back to
plain sequential lookups when keys are already warm from the batch.

This eliminates redundant prefetch work that caused regressions
of up to -16% with many I/O threads, while preserving the large
gains (+52-54%) for pipelined workloads where the batch overflows.
@CLAassistant
Copy link

CLAassistant commented Mar 17, 2026

CLA assistant check
All committers have signed the CLA.

@fcostaoliveira fcostaoliveira requested a review from sundb March 17, 2026 17:22
@sundb sundb requested a review from ShooterIT March 18, 2026 00:33
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor


/* Stack-allocate per-key state — callers should keep nkeys bounded
* (typically ≤ 16–32) to avoid excessive stack usage. */
KeyPrefetchInfo pf_info[nkeys];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninitialized VLA when dictPrefetchKeys receives nkeys of 2

Low Severity

The KeyPrefetchInfo pf_info[nkeys] VLA is stack-allocated but never zero-initialized before ctxInit runs. While ctxInit does set state for every element, it only sets ht_idx, current_entry, and current_kv for keys whose dict is non-null and non-empty. For keys marked PREFETCH_DONE in ctxInit (empty/null dict), the remaining fields like key_hash contain indeterminate values. This is benign today since PREFETCH_DONE keys are skipped, but it's fragile — any future state machine change that reads those fields could silently use garbage data.

Fix in Cursor Fix in Web


/* Run the prefetch state machine — after this call, dict buckets,
* entries, kv objects, and value data for all n keys are in cache. */
dictPrefetchKeys(dicts, keys, n, prefetchGetObjectValuePtr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant prefetch work on final single-key batch

Low Severity

When the total number of keys is not a multiple of MGET_BATCH (16), the final batch may have n == 1. dictPrefetchKeys returns immediately for nkeys <= 1, so the last key gets no prefetch benefit. However, when the total numkeys == 2, the entire call to dictPrefetchKeys in the first (and only) batch proceeds with nkeys == 2, which is fine. The issue is limited to uneven tail batches of large MGET calls — the last single key misses prefetch while all others get it. This is minor but worth noting.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants