Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add last_profiled_frame field to thread state for remote profilers
Remote profilers that sample call stacks from external processes need to read the entire frame chain on every sample. For deep stacks, this is expensive since most of the stack is typically unchanged between samples.

This adds a `last_profiled_frame` pointer that remote profilers can use to implement a caching optimization. When sampling, a profiler writes the current frame address here. The eval loop then keeps this pointer valid by updating it to the parent frame in _PyEval_FrameClearAndPop. This creates a "high-water mark" that always points to a frame still on the stack, allowing profilers to skip reading unchanged portions of the stack.

The check in ceval.c is guarded so there's zero overhead when profiling isn't active (the field starts NULL and the branch is predictable).
  • Loading branch information
pablogsal committed Dec 1, 2025
commit b2ca1aca447acf9490e0d28fdda314d78d7a571e
2 changes: 2 additions & 0 deletions Include/cpython/pystate.h
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@ struct _ts {
/* Pointer to currently executing frame. */
struct _PyInterpreterFrame *current_frame;

struct _PyInterpreterFrame *last_profiled_frame;

Py_tracefunc c_profilefunc;
Py_tracefunc c_tracefunc;
PyObject *c_profileobj;
Expand Down
2 changes: 2 additions & 0 deletions Include/internal/pycore_debug_offsets.h
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ typedef struct _Py_DebugOffsets {
uint64_t next;
uint64_t interp;
uint64_t current_frame;
uint64_t last_profiled_frame;
uint64_t thread_id;
uint64_t native_thread_id;
uint64_t datastack_chunk;
Expand Down Expand Up @@ -272,6 +273,7 @@ typedef struct _Py_DebugOffsets {
.next = offsetof(PyThreadState, next), \
.interp = offsetof(PyThreadState, interp), \
.current_frame = offsetof(PyThreadState, current_frame), \
.last_profiled_frame = offsetof(PyThreadState, last_profiled_frame), \
.thread_id = offsetof(PyThreadState, thread_id), \
.native_thread_id = offsetof(PyThreadState, native_thread_id), \
.datastack_chunk = offsetof(PyThreadState, datastack_chunk), \
Expand Down
18 changes: 18 additions & 0 deletions InternalDocs/frames.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,24 @@ The shim frame points to a special code object containing the `INTERPRETER_EXIT`
instruction which cleans up the shim frame and returns.


### Remote Profiling Frame Cache

The `last_profiled_frame` field in `PyThreadState` supports an optimization for
remote profilers that sample call stacks from external processes. When a remote
profiler reads the call stack, it writes the current frame address to this field.
The eval loop then keeps this pointer valid by updating it to the parent frame
whenever a frame returns (in `_PyEval_FrameClearAndPop`).

This creates a "high-water mark" that always points to a frame still on the stack.
On subsequent samples, the profiler can walk from `current_frame` until it reaches
`last_profiled_frame`, knowing that frames from that point downward are unchanged
and can be retrieved from a cache. This significantly reduces the amount of remote
memory reads needed when call stacks are deep and stable at their base.

The update in `_PyEval_FrameClearAndPop` is guarded: it only writes when
`last_profiled_frame` is non-NULL, avoiding any overhead when profiling is inactive.


### The Instruction Pointer

`_PyInterpreterFrame` has two fields which are used to maintain the instruction
Expand Down
7 changes: 7 additions & 0 deletions Python/ceval.c
Original file line number Diff line number Diff line change
Expand Up @@ -2004,6 +2004,13 @@ clear_gen_frame(PyThreadState *tstate, _PyInterpreterFrame * frame)
void
_PyEval_FrameClearAndPop(PyThreadState *tstate, _PyInterpreterFrame * frame)
{
// Update last_profiled_frame for remote profiler frame caching.
// By this point, tstate->current_frame is already set to the parent frame.
// The guarded check avoids writes when profiling is not active (predictable branch).
if (tstate->last_profiled_frame != NULL) {
Copy link
Member Author

@pablogsal pablogsal Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: This should be obvious but in case someone is worried I did my homework and this guard check has zero measurable cost and pyperformance shows 1.00x (no change). It's literally free.


This is the only place outside the profiler that has a modification, and before anyone considers if this has any effect: the answer is that it has literally zero effect. Here is how I measured it.

The guard check added to _PyEval_FrameClearAndPop compiles down to just two instructions:

278cab: cmpq   $0x0,0x50(%rdi)    ; Compare last_profiled_frame with NULL
278cb0: je     278cba             ; Jump if equal (skip the write)

During normal operation when no profiler is attached, last_profiled_frame is always NULL, which means this branch always takes the exact same path. Modern CPUs predict this perfectly after just a few iterations and never mispredict it again. A perfectly predicted branch executes speculatively with zero pipeline stalls, making it effectively free.

I confirmed this using Linux perf with hardware performance counters. I ran the Python test suite and some selected tests (test_list test_tokenize test_gc test_dict test_ast test_compile) at maximum sample rate (99,999 Hz) collecting separate data files for CPU cycles, branch mispredictions, and cache misses:

perf record -F 99999 -g -o ./perf_cycles.data -- ./python -m test 
perf record -e branch-misses -F 99999 -g -o ./perf_branch_misses.data -- ./python -m test 

First, I checked how much the entire function contributes to total CPU time:

$ perf report -i ./perf_cycles.data --stdio --sort=symbol | grep FrameClearAndPop
# Samples: 20K of event 'cpu_atom/cycles/P'
     0.10%     0.09%  [.] _PyEval_FrameClearAndPop
# Samples: 422K of event 'cpu_core/cycles/P'
     0.12%     0.10%  [.] _PyEval_FrameClearAndPop

The whole function is only 0.10% of cycles on P-cores and 0.09% on E-cores, so we're already in negligible territory. But the real question is whether the guard check causes any branch mispredictions.

I checked the function's contribution to total branch misses:

$ perf report -i ./perf_branch_misses.data --stdio --sort=symbol | grep FrameClearAndPop
# Samples: 12K of event 'cpu_atom/branch-misses/'
     0.07%     0.06%  [.] _PyEval_FrameClearAndPop
# Samples: 162K of event 'cpu_core/branch-misses/'
     0.11%     0.11%  [.] _PyEval_FrameClearAndPop

The entire function is only 0.11% of total branch misses. But within that, how much does our guard check contribute? I used perf annotate to see exactly which instructions caused branch misses within the function:

$ perf annotate -i ./perf_branch_misses.data _PyEval_FrameClearAndPop --stdio

This command reads the branch misprediction samples and maps them to specific instructions, showing what percentage of the function's branch misses occurred at each location. The result for our guard check:

; The guard check - if (tstate->last_profiled_frame != NULL)
    0.00 :   278cab: cmpq   $0x0,0x50(%rdi)   ; ← 0.00% branch misses
    0.00 :   278cb0: je     278cba            ; ← 0.00% branch misses (PERFECTLY PREDICTED)

Zero. Not a single branch misprediction was sampled at that instruction across hundreds of thousands of samples. The CPU's branch predictor correctly predicts this branch every single time because it always takes the same path.

For comparison, here's more of the annotated output showing other branches in the same function:

    0.00 :   278cb0: je     278cba    ; Guard check - 0.00% misses
   32.25 :   278cbe: jne    278d00    ; frame->owner check - 32.25% misses
   50.13 :   278cc8: call   2c39e0    ; Function call
   21.62 :   278ce5: je     278d60    ; Refcount check - 21.62% misses

The frame ownership check (frame->owner == FRAME_OWNED_BY_THREAD) accounts for 32.25% of the function's branch misses, and the refcount check (--op->ob_refcnt == 0) accounts for 21.62%. These are data-dependent branches that the CPU cannot predict perfectly. Our guard check contributes exactly 0.00% because it is perfectly predictable, unlike these other branches that depend on runtime data.

The overall Python branch miss rate is already very low (0.03% of all branches), and the guard check contributes nothing to this.

Finally, I ran pyperformance comparing main (ea51e745c713) against this PR (8d4a83894398). The geometric mean across all benchmarks is 1.00x, confirming no measurable regression in real-world workloads:

Pyperformance run:

All benchmarks:

Benchmark main-ea51e745c713 PR-8d4a83894398
subparsers 100 ms 97.8 ms: 1.02x faster
async_generators 646 ms 664 ms: 1.03x slower
bpe_tokeniser 6.87 sec 7.06 sec: 1.03x slower
comprehensions 27.3 us 26.6 us: 1.03x faster
coverage 134 ms 131 ms: 1.02x faster
crypto_pyaes 121 ms 124 ms: 1.02x slower
deepcopy_reduce 6.94 us 6.76 us: 1.03x faster
fannkuch 656 ms 640 ms: 1.02x faster
generators 45.4 ms 44.4 ms: 1.02x faster
logging_format 17.4 us 17.0 us: 1.02x faster
logging_simple 15.2 us 14.8 us: 1.03x faster
mdp 2.09 sec 2.13 sec: 1.02x slower
nbody 176 ms 180 ms: 1.02x slower
pickle_pure_python 647 us 630 us: 1.03x faster
regex_compile 227 ms 223 ms: 1.02x faster
regex_dna 255 ms 262 ms: 1.03x slower
regex_effbot 4.23 ms 4.34 ms: 1.03x slower
scimark_monte_carlo 108 ms 110 ms: 1.02x slower
scimark_sor 188 ms 184 ms: 1.02x faster
spectral_norm 154 ms 150 ms: 1.03x faster
unpack_sequence 69.2 ns 71.1 ns: 1.03x slower
xdsl_constant_fold 97.6 ms 95.4 ms: 1.02x faster
xml_etree_generate 148 ms 144 ms: 1.03x faster
xml_etree_process 107 ms 105 ms: 1.02x faster
Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse

So the conclusion is that the branch is perfectly predicted, adds no memory traffic beyond reading a value already in L1 cache (tstate is hot), and avoids cache line dirtying when the profiler is not attached. Zero cost.

tstate->last_profiled_frame = tstate->current_frame;
}

if (frame->owner == FRAME_OWNED_BY_THREAD) {
clear_thread_frame(tstate, frame);
}
Expand Down