gh-127022: Simplify PyStackRef_FromPyObjectSteal#127024
gh-127022: Simplify PyStackRef_FromPyObjectSteal#127024colesbury merged 5 commits intopython:mainfrom
PyStackRef_FromPyObjectSteal#127024Conversation
This gets rid of the immortal check in `PyStackRef_FromPyObjectSteal()`. Overall, this improves performance about 2% in the free threading build. This also renames `PyStackRef_Is()` to `PyStackRef_IsExactly()` because the macro requires that the tag bits of the arguments match, which is only true in certain special cases.
2c43ad0 to
5583ac0
Compare
|
Said benchmark: https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20241118-3.14.0a1+-ed7085a-NOGIL I was thinking of how this breaks the nice encapsulation we have :(, but 2% speedup is too good to give up. |
Co-authored-by: Pieter Eendebak <[email protected]>
Python/bytecodes.c
Outdated
| replaced op(_POP_JUMP_IF_TRUE, (cond -- )) { | ||
| assert(PyStackRef_BoolCheck(cond)); | ||
| int flag = PyStackRef_Is(cond, PyStackRef_True); | ||
| int flag = PyStackRef_IsExactly(cond, PyStackRef_True); |
There was a problem hiding this comment.
Why do we use PyStackRef_IsExactly here (which doesn't mask out the deferred bit) but use PyStackRef_IsFalse (which does mask out the deferred bit) in _POP_JUMP_IF_FALSE above? Is this the rare case where it's safe?
There was a problem hiding this comment.
Our codegen ensures that these ops only see True or False. That's often by adding a TO_BOOL immediately before, which may be folded into COMPARE_OP. The preceding TO_BOOL, including in COMPARE_OP, ensures the canonical representation of PyStackRef_False or PyStackRef_True with the deferred bit set.
However, there are two places in codegen.c that omit the TO_BOOL because they have other reasons to know that the result is exactly a boolean:
Lines 678 to 682 in 09c240f
Lines 5746 to 5749 in 09c240f
The COMPARE_OPs here still generate bools, but not always in the canonical representation. So we can either:
- Modify
COMPARE_OPto ensure the canonical representation like https://github.com/colesbury/cpython/blob/5583ac0c311132e36ef458842e087945898ffdec/Python/bytecodes.c#L2409-L2416 - Use
PyStackRef_IsFalse(instead ofPyStackRef_IsExactly) in theJUMP_IF_FALSE - Modify the codegen by inserting
TO_BOOLin those two spots.
There was a problem hiding this comment.
That makes sense, thanks for the explanation. Since using PyStackRef_IsExactly safely is sensitive to code generation changes, I might suggest using it only when we're sure it actually matters for performance, and default to using the variants that mask out the deferred bits everywhere by default since those are always safe. I'd guess that this wouldn't affect the performance improvement of this change much, since it should come from avoiding the tagging in _PyStackRef_FromPyObjectSteal. I don't feel super strongly though.
There was a problem hiding this comment.
I'll switch to using PyStackRef_IsFalse and PyStackRef_IsTrue.
I'm no longer convinced that PyStackRef_IsExactly is actually a performance win (and I didn't see it in measurements). I think we have issues with code generation quality that we'll need to address later. Things like POP_JUMP_IF_NONE are composed of _IS_NONE and _POP_JUMP_IF_TRUE and we pack the intermediate result in a tagged _PyStackRef. Clang does a pretty good job of optimizing through it. GCC less so: https://gcc.godbolt.org/z/Ejs8c78qd.
No, the previous checks were okay when |
Python/bytecodes.c
Outdated
| replaced op(_POP_JUMP_IF_TRUE, (cond -- )) { | ||
| assert(PyStackRef_BoolCheck(cond)); | ||
| int flag = PyStackRef_Is(cond, PyStackRef_True); | ||
| int flag = PyStackRef_IsExactly(cond, PyStackRef_True); |
There was a problem hiding this comment.
That makes sense, thanks for the explanation. Since using PyStackRef_IsExactly safely is sensitive to code generation changes, I might suggest using it only when we're sure it actually matters for performance, and default to using the variants that mask out the deferred bits everywhere by default since those are always safe. I'd guess that this wouldn't affect the performance improvement of this change much, since it should come from avoiding the tagging in _PyStackRef_FromPyObjectSteal. I don't feel super strongly though.
|
Benchmark on most recent changes: https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20241122-3.14.0a1+-a9e4872-NOGIL#vs-base
|
This gets rid of the immortal check in `PyStackRef_FromPyObjectSteal()`. Overall, this improves performance about 2% in the free threading build. This also renames `PyStackRef_Is()` to `PyStackRef_IsExactly()` because the macro requires that the tag bits of the arguments match, which is only true in certain special cases.
This gets rid of the immortal check in
PyStackRef_FromPyObjectSteal(). Overall, this improves performance about 1-2% in the free threading build.This also renames
PyStackRef_Is()toPyStackRef_IsExactly()because the macro requires that the tag bits of the arguments match, which is only true in certain special cases.PyStackRef_FromPyObjectSteal#127022