-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-138122: Implement frame caching in RemoteUnwinder to reduce memory reads #142137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
b2ca1ac
Add last_profiled_frame field to thread state for remote profilers
pablogsal f36111b
Move write_memory implementation to shared remote_debug.h header
pablogsal d847cf7
Implement frame caching in RemoteUnwinder to reduce memory reads
pablogsal 19047cf
Enable frame caching by default in the sampling profiler
pablogsal c56c1e1
Add tests for frame caching in remote debugging
pablogsal adb0429
Add thread test
pablogsal 315ac74
Add NEWS entry
pablogsal 7d7daa5
Clear stale last_profiled_frame values when initializing RemoteUnwinder
pablogsal f16f8a5
Fix NEWS entry Sphinx references
pablogsal b4e3295
Regen stuff
pablogsal d2aa8a4
Fix frame cache miss handling in RemoteUnwinder to return complete st…
pablogsal 6e88425
make test more reliable
pablogsal cf15c90
Rework the cache
pablogsal a9a9efe
make test more reliable
pablogsal c5307fb
Return only from the frame profiled
pablogsal 5ba13c5
Bump the expected frames
pablogsal 7ecbe13
Add stats
pablogsal 2711907
Regen files
pablogsal 8df2106
Fix frame cache to always re-read current frame for accurate line num…
pablogsal 1944a02
Fixes for small edge cases, docs and cache stats
pablogsal e528bc1
Fix limits
pablogsal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
Add last_profiled_frame field to thread state for remote profilers
Remote profilers that sample call stacks from external processes need to read the entire frame chain on every sample. For deep stacks, this is expensive since most of the stack is typically unchanged between samples. This adds a `last_profiled_frame` pointer that remote profilers can use to implement a caching optimization. When sampling, a profiler writes the current frame address here. The eval loop then keeps this pointer valid by updating it to the parent frame in _PyEval_FrameClearAndPop. This creates a "high-water mark" that always points to a frame still on the stack, allowing profilers to skip reading unchanged portions of the stack. The check in ceval.c is guarded so there's zero overhead when profiling isn't active (the field starts NULL and the branch is predictable).
- Loading branch information
commit b2ca1aca447acf9490e0d28fdda314d78d7a571e
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: This should be obvious but in case someone is worried I did my homework and this guard check has zero measurable cost and pyperformance shows 1.00x (no change). It's literally free.
This is the only place outside the profiler that has a modification, and before anyone considers if this has any effect: the answer is that it has literally zero effect. Here is how I measured it.
The guard check added to
_PyEval_FrameClearAndPopcompiles down to just two instructions:During normal operation when no profiler is attached,
last_profiled_frameis alwaysNULL, which means this branch always takes the exact same path. Modern CPUs predict this perfectly after just a few iterations and never mispredict it again. A perfectly predicted branch executes speculatively with zero pipeline stalls, making it effectively free.I confirmed this using Linux
perfwith hardware performance counters. I ran the Python test suite and some selected tests (test_list test_tokenize test_gc test_dict test_ast test_compile) at maximum sample rate (99,999 Hz) collecting separate data files for CPU cycles, branch mispredictions, and cache misses:First, I checked how much the entire function contributes to total CPU time:
$ perf report -i ./perf_cycles.data --stdio --sort=symbol | grep FrameClearAndPopThe whole function is only 0.10% of cycles on P-cores and 0.09% on E-cores, so we're already in negligible territory. But the real question is whether the guard check causes any branch mispredictions.
I checked the function's contribution to total branch misses:
$ perf report -i ./perf_branch_misses.data --stdio --sort=symbol | grep FrameClearAndPopThe entire function is only 0.11% of total branch misses. But within that, how much does our guard check contribute? I used
perf annotateto see exactly which instructions caused branch misses within the function:This command reads the branch misprediction samples and maps them to specific instructions, showing what percentage of the function's branch misses occurred at each location. The result for our guard check:
Zero. Not a single branch misprediction was sampled at that instruction across hundreds of thousands of samples. The CPU's branch predictor correctly predicts this branch every single time because it always takes the same path.
For comparison, here's more of the annotated output showing other branches in the same function:
The frame ownership check (
frame->owner == FRAME_OWNED_BY_THREAD) accounts for 32.25% of the function's branch misses, and the refcount check (--op->ob_refcnt == 0) accounts for 21.62%. These are data-dependent branches that the CPU cannot predict perfectly. Our guard check contributes exactly 0.00% because it is perfectly predictable, unlike these other branches that depend on runtime data.The overall Python branch miss rate is already very low (0.03% of all branches), and the guard check contributes nothing to this.
Finally, I ran pyperformance comparing main (
ea51e745c713) against this PR (8d4a83894398). The geometric mean across all benchmarks is 1.00x, confirming no measurable regression in real-world workloads:Pyperformance run:
All benchmarks:
Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse
So the conclusion is that the branch is perfectly predicted, adds no memory traffic beyond reading a value already in L1 cache (tstate is hot), and avoids cache line dirtying when the profiler is not attached. Zero cost.