Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899
Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899mpozniak95 wants to merge 3 commits intoredis:unstablefrom
Conversation
…keys Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in addCommandToBatch() when a command is added to the cross-command prefetch batch. In mgetCommand, check the flag and fall back to plain sequential lookups when keys are already warm from the batch. This eliminates redundant prefetch work that caused regressions of up to -16% with many I/O threads, while preserving the large gains (+52-54%) for pipelined workloads where the batch overflows.
|
|
||
| /* Stack-allocate per-key state — callers should keep nkeys bounded | ||
| * (typically ≤ 16–32) to avoid excessive stack usage. */ | ||
| KeyPrefetchInfo pf_info[nkeys]; |
There was a problem hiding this comment.
Uninitialized VLA when dictPrefetchKeys receives nkeys of 2
Low Severity
The KeyPrefetchInfo pf_info[nkeys] VLA is stack-allocated but never zero-initialized before ctxInit runs. While ctxInit does set state for every element, it only sets ht_idx, current_entry, and current_kv for keys whose dict is non-null and non-empty. For keys marked PREFETCH_DONE in ctxInit (empty/null dict), the remaining fields like key_hash contain indeterminate values. This is benign today since PREFETCH_DONE keys are skipped, but it's fragile — any future state machine change that reads those fields could silently use garbage data.
|
|
||
| /* Run the prefetch state machine — after this call, dict buckets, | ||
| * entries, kv objects, and value data for all n keys are in cache. */ | ||
| dictPrefetchKeys(dicts, keys, n, prefetchGetObjectValuePtr); |
There was a problem hiding this comment.
Redundant prefetch work on final single-key batch
Low Severity
When the total number of keys is not a multiple of MGET_BATCH (16), the final batch may have n == 1. dictPrefetchKeys returns immediately for nkeys <= 1, so the last key gets no prefetch benefit. However, when the total numkeys == 2, the entire call to dictPrefetchKeys in the first (and only) batch proceeds with nkeys == 2, which is fine. The issue is limited to uneven tail batches of large MGET calls — the last single key misses prefetch while all others get it. This is minor but worth noting.


Motivation
The existing
memory_prefetch.cframework amortizes dictionary lookup cache misses across commands in I/O-thread batches, but provides no mechanism for a single multi-key command to prefetch its own keys. When a command likeMGET key1 ... key100executes, eachlookupKeyRead()triggers a chain of cache misses (hash bucket → entry → kvobj → value data). With 5M+ keys, nearly every access is an L3/DRAM miss. For MGET 100 keys this means ~400 sequential cache misses with the CPU stalling on each one.This PR refactors the prefetch state machine to be reusable and applies it inside
mgetCommandas the first consumer.Changes
1. New
DictPrefetchCtxabstraction (memory_prefetch.c)Extracted a lightweight context struct from the global
PrefetchCommandsBatch. All state machine functions (prefetchBucket,prefetchEntry,prefetchKVObject,prefetchValueData) were refactored to operate on aDictPrefetchCtx *parameter instead of accessing the globalbatchdirectly. The existing cross-command I/O-thread path creates aDictPrefetchCtxfrom the globalbatchfields — behavior is fully backward-compatible.2. New public API (
memory_prefetch.h)dictPrefetchKeys()stack-allocatesKeyPrefetchInfo[nkeys]and runs the same 5-state round-robin machine (BUCKET → ENTRY → KVOBJ → VALDATA → DONE). Zero heap allocations in the hot path.3. Cross-command awareness (
memory_prefetch.c+server.h)Added
PENDING_CMD_KEYS_PREFETCHEDflag topendingCommand. WhenaddCommandToBatch()successfully adds all keys of a command to the cross-command batch, it sets this flag.mgetCommandchecks the flag and skips intra-command prefetching when keys are already warm — avoiding redundant work when I/O threads have already prefetched everything.4. Optimized
mgetCommand(t_string.c)Processes keys in batches of 16 with three fast paths:
lookupKeyRead()for side effectsBenchmark results
Tested on r8i.4xlarge ubuntu24.04 with
redis-benchmarks-specificationmemtier suites. best of 3 iterations.Single-threaded (no I/O threads)
Single-threaded: +3% to +50% across all tests, zero regressions.
4 I/O threads
4 I/O threads: up to +68% on pipelined workloads. Regressions on non-pipelined high-key-count tests (20, 100 keys) possibly due to partial overlap with cross-command prefetching.
16 I/O threads
16 I/O threads: up to +56% on pipelined workloads. Same regression pattern as 4 I/O threads on non-pipelined high-key-count tests.
Note
Medium Risk
Touches performance-critical key-lookup paths by refactoring the prefetch state machine and changing
mgetCommandexecution flow; includes new stack-based per-batch state and a new pending-command flag that could affect prefetch behavior under load.Overview
Refactors the dict-lookup prefetch state machine to run off a new
DictPrefetchCtx, and exposes it as a publicdictPrefetchKeys()API (plusprefetchGetObjectValuePtr()callback) so a single multi-key command can prefetch its own keys.Optimizes
MGETto prefetch keys in batches (default 16) before sequentiallookupKeyRead()calls, with fast paths for single-key/empty-dict cases and a skip when keys were already warmed by cross-command batching.Adds
PENDING_CMD_KEYS_PREFETCHEDand updatesaddCommandToBatch()to set it only when all keys for a pending command were included in the cross-command prefetch batch, avoiding redundant intra-command prefetching when batching partially fills.Written by Cursor Bugbot for commit b0a5456. This will update automatically on new commits. Configure here.