Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET by mpozniak95 · Pull Request #14899 · redis/redis

mpozniak95 · 2026-03-17T17:16:27Z

Motivation

The existing memory_prefetch.c framework amortizes dictionary lookup cache misses across commands in I/O-thread batches, but provides no mechanism for a single multi-key command to prefetch its own keys. When a command like MGET key1 ... key100 executes, each lookupKeyRead() triggers a chain of cache misses (hash bucket → entry → kvobj → value data). With 5M+ keys, nearly every access is an L3/DRAM miss. For MGET 100 keys this means ~400 sequential cache misses with the CPU stalling on each one.

This PR refactors the prefetch state machine to be reusable and applies it inside mgetCommand as the first consumer.

Changes

1. New `DictPrefetchCtx` abstraction (`memory_prefetch.c`)

Extracted a lightweight context struct from the global PrefetchCommandsBatch. All state machine functions (prefetchBucket, prefetchEntry, prefetchKVObject, prefetchValueData) were refactored to operate on a DictPrefetchCtx * parameter instead of accessing the global batch directly. The existing cross-command I/O-thread path creates a DictPrefetchCtx from the global batch fields — behavior is fully backward-compatible.

2. New public API (`memory_prefetch.h`)

void dictPrefetchKeys(struct dict **dicts, void **keys, size_t nkeys,
                      PrefetchGetValueDataFunc get_val_data);
void *prefetchGetObjectValuePtr(const void *val);

dictPrefetchKeys() stack-allocates KeyPrefetchInfo[nkeys] and runs the same 5-state round-robin machine (BUCKET → ENTRY → KVOBJ → VALDATA → DONE). Zero heap allocations in the hot path.

3. Cross-command awareness (`memory_prefetch.c` + `server.h`)

Added PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. When addCommandToBatch() successfully adds all keys of a command to the cross-command batch, it sets this flag. mgetCommand checks the flag and skips intra-command prefetching when keys are already warm — avoiding redundant work when I/O threads have already prefetched everything.

4. Optimized `mgetCommand` (`t_string.c`)

Processes keys in batches of 16 with three fast paths:

Single key: direct lookup, zero overhead
Empty dict: NULL-reply loop with lookupKeyRead() for side effects
Cross-command prefetched: simple loop when I/O threads already warmed the cache

Benchmark results

Tested on r8i.4xlarge ubuntu24.04 with redis-benchmarks-specification memtier suites. best of 3 iterations.

Single-threaded (no I/O threads)

Test	DB size	Value size	Main (ops/s)	Optimized (ops/s)	Δ
MGET 100 keys	5M	100B	26,325	39,573	+50%
MGET 30 keys, P=10	5M	100B	84,181	122,095	+45%
MGET 10 keys	5M	100B	119,446	147,563	+24%
MGET 30 keys	5M	100B	74,634	91,602	+23%
MGET 20 keys	5M	100B	95,328	112,204	+18%
MGET 5 keys	5M	512B	134,749	155,004	+15%
MGET 5 keys	5M	1KiB	127,420	145,808	+14%
MGET 5 keys, P=10	5M	512B	280,572	314,293	+12%
MGET 5 keys	5M	100B	155,014	170,170	+10%
MGET 2 keys	5M	100B	174,118	189,762	+9%
MGET 5 keys, P=10	5M	1KiB	133,109	145,478	+9%
MGET 5 keys	1M	1KiB	119,035	126,642	+6%
MGET 2 keys, P=10	5M	100B	885,126	927,125	+5%
MGET 100 keys, P=10	5M	100B	49,941	51,870	+4%
MGET 5 keys, P=10	5M	100B	478,288	494,179	+3%

Single-threaded: +3% to +50% across all tests, zero regressions.

4 I/O threads

Test	DB size	Value size	Main (ops/s)	Optimized (ops/s)	Δ
MGET 30 keys, P=10	5M	100B	102,871	173,263	+68%
MGET 100 keys, P=10	5M	100B	28,133	44,083	+57%
MGET 5 keys, P=10	5M	100B	720,426	843,673	+17%
MGET 10 keys	5M	100B	376,988	413,916	+10%
MGET 2 keys	5M	100B	474,214	506,097	+7%
MGET 2 keys, P=10	5M	100B	1,989,495	2,113,546	+6%
MGET 5 keys	5M	512B	453,154	467,093	+3%
MGET 30 keys	5M	100B	207,203	211,638	+2%
MGET 5 keys, P=10	5M	512B	522,417	530,606	+2%
MGET 5 keys	5M	100B	459,781	466,000	+1%
MGET 5 keys, P=10	5M	1KiB	298,161	298,208	~0%
MGET 5 keys	5M	1KiB	298,122	298,262	~0%
HMGET 5 fields, P=10	1M	100B	2,042,931	2,017,445	-1%
MGET 5 keys	1M	1KiB	135,522	131,678	-3%
MGET 20 keys	5M	100B	261,383	252,574	-3%
MGET 100 keys	5M	100B	77,969	69,382	-11%

4 I/O threads: up to +68% on pipelined workloads. Regressions on non-pipelined high-key-count tests (20, 100 keys) possibly due to partial overlap with cross-command prefetching.

16 I/O threads

Test	DB size	Value size	Main (ops/s)	Optimized (ops/s)	Δ
MGET 30 keys, P=10	5M	100B	114,994	179,161	+56%
MGET 100 keys, P=10	5M	100B	31,648	48,894	+54%
MGET 5 keys, P=10	5M	512B	484,134	564,920	+17%
MGET 5 keys, P=10	5M	100B	533,706	612,046	+15%
MGET 5 keys	5M	100B	418,656	432,231	+3%
MGET 5 keys	5M	512B	406,153	416,785	+3%
MGET 10 keys	5M	100B	323,832	331,224	+2%
MGET 2 keys, P=10	5M	100B	1,491,821	1,525,486	+2%
MGET 2 keys	5M	100B	522,589	524,531	~0%
HMGET 5 fields, P=10	1M	100B	1,715,306	1,722,662	~0%
MGET 5 keys	5M	1KiB	298,251	298,136	~0%
MGET 5 keys, P=10	5M	1KiB	298,381	298,229	~0%
MGET 30 keys	5M	100B	163,822	157,660	-4%
MGET 5 keys	1M	1KiB	136,597	130,641	-4%
MGET 20 keys	5M	100B	209,638	198,278	-5%
MGET 100 keys	5M	100B	65,570	59,888	-9%

16 I/O threads: up to +56% on pipelined workloads. Same regression pattern as 4 I/O threads on non-pipelined high-key-count tests.

Note

Medium Risk
Touches performance-critical key-lookup paths by refactoring the prefetch state machine and changing mgetCommand execution flow; includes new stack-based per-batch state and a new pending-command flag that could affect prefetch behavior under load.

Overview
Refactors the dict-lookup prefetch state machine to run off a new DictPrefetchCtx, and exposes it as a public dictPrefetchKeys() API (plus prefetchGetObjectValuePtr() callback) so a single multi-key command can prefetch its own keys.

Optimizes MGET to prefetch keys in batches (default 16) before sequential lookupKeyRead() calls, with fast paths for single-key/empty-dict cases and a skip when keys were already warmed by cross-command batching.

Adds PENDING_CMD_KEYS_PREFETCHED and updates addCommandToBatch() to set it only when all keys for a pending command were included in the cross-command prefetch batch, avoiding redundant intra-command prefetching when batching partially fills.

^{Written by Cursor Bugbot for commit b0a5456. This will update automatically on new commits. Configure here.}

…) API

…keys Add PENDING_CMD_KEYS_PREFETCHED flag to pendingCommand. Set it in addCommandToBatch() when a command is added to the cross-command prefetch batch. In mgetCommand, check the flag and fall back to plain sequential lookups when keys are already warm from the batch. This eliminates redundant prefetch work that caused regressions of up to -16% with many I/O threads, while preserving the large gains (+52-54%) for pipelined workloads where the batch overflows.

CLAassistant · 2026-03-17T17:16:43Z

All committers have signed the CLA.

src/memory_prefetch.c

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

cursor · 2026-03-19T12:04:51Z

src/memory_prefetch.c

+
+    /* Stack-allocate per-key state — callers should keep nkeys bounded
+     * (typically ≤ 16–32) to avoid excessive stack usage. */
+    KeyPrefetchInfo pf_info[nkeys];


Uninitialized VLA when dictPrefetchKeys receives nkeys of 2

Low Severity

The KeyPrefetchInfo pf_info[nkeys] VLA is stack-allocated but never zero-initialized before ctxInit runs. While ctxInit does set state for every element, it only sets ht_idx, current_entry, and current_kv for keys whose dict is non-null and non-empty. For keys marked PREFETCH_DONE in ctxInit (empty/null dict), the remaining fields like key_hash contain indeterminate values. This is benign today since PREFETCH_DONE keys are skipped, but it's fragile — any future state machine change that reads those fields could silently use garbage data.

cursor · 2026-03-19T12:04:51Z

src/t_string.c

+
+        /* Run the prefetch state machine — after this call, dict buckets,
+         * entries, kv objects, and value data for all n keys are in cache. */
+        dictPrefetchKeys(dicts, keys, n, prefetchGetObjectValuePtr);


Redundant prefetch work on final single-key batch

Low Severity

When the total number of keys is not a multiple of MGET_BATCH (16), the final batch may have n == 1. dictPrefetchKeys returns immediately for nkeys <= 1, so the last key gets no prefetch benefit. However, when the total numkeys == 2, the entire call to dictPrefetchKeys in the first (and only) batch proceeds with nkeys == 2, which is fine. The issue is limited to uneven tail batches of large MGET calls — the last single key misses prefetch while all others get it. This is minor but worth noting.

mpozniak95 added 2 commits March 2, 2026 03:13

Extend memory_prefetch framework with intra-command dictPrefetchKeys(…

4fba1ef

…) API

fcostaoliveira requested a review from sundb March 17, 2026 17:22

cursor bot reviewed Mar 17, 2026

View reviewed changes

src/memory_prefetch.c Outdated Show resolved Hide resolved

sundb requested a review from ShooterIT March 18, 2026 00:33

Remove unused fields current_dicts and get_value_data_func

b0a5456

cursor bot reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899

Extend memory_prefetch framework with intra-command dictPrefetchKeys() API and optimize MGET#14899
mpozniak95 wants to merge 3 commits intoredis:unstablefrom
mpozniak95:mget_prefetch_framework

mpozniak95 commented Mar 17, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 19, 2026

Uh oh!

cursor bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mpozniak95 commented Mar 17, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

1. New DictPrefetchCtx abstraction (memory_prefetch.c)

2. New public API (memory_prefetch.h)

3. Cross-command awareness (memory_prefetch.c + server.h)

4. Optimized mgetCommand (t_string.c)

Benchmark results

Single-threaded (no I/O threads)

4 I/O threads

16 I/O threads

Uh oh!

CLAassistant commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 19, 2026

Choose a reason for hiding this comment

Uninitialized VLA when dictPrefetchKeys receives nkeys of 2

Uh oh!

cursor bot Mar 19, 2026

Choose a reason for hiding this comment

Redundant prefetch work on final single-key batch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mpozniak95 commented Mar 17, 2026 •

edited by cursor bot

Loading

1. New `DictPrefetchCtx` abstraction (`memory_prefetch.c`)

2. New public API (`memory_prefetch.h`)

3. Cross-command awareness (`memory_prefetch.c` + `server.h`)

4. Optimized `mgetCommand` (`t_string.c`)

CLAassistant commented Mar 17, 2026 •

edited

Loading