gh-131798: Support generator frames in the JIT optimizer
This PR covers the iteration of generators in the JIT optimizer.
Benchmark results: 1.5% faster on AArch64 macOS https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20260102-3.15.0a3%2B-8f7b4f4-JIT/bm-20260102-macm4pro-arm64-Fidget%252dSpinner-cover_more_frames-3.15.0a3%2B-8f7b4f4-vs-base.md
0% faster on x86-64 Linux. https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20260102-3.15.0a3%2B-8f7b4f4-JIT/bm-20260102-vultr-x86_64-Fidget%252dSpinner-cover_more_frames-3.15.0a3%2B-8f7b4f4-vs-base.md
I think the macOS results are a little suspicious, maybe the baseline was when the machine was too hot or something? No clue.
- Issue: gh-131798
When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.
This PR also does loop peeling. We should probably do that, but in another PR where we can assess its merits and when to peel, and when not.
No this doesn't peel. The problem is that generators have a JUMP_BACKWARD in them. If we don't do this, the generators are not traced through and just end early. In fact, none of the generator tests form a loop or pass if I don't add this change. So to test generators, I need this in this PR.
Note that we still close a loop when we see it in one iteration, not a peeled iteration. This only affects tracing inner loops, such as when tracing through a generator, or a double nested for loop. In those cases, the inner loop is now traced one more iteration, which is what we want for generators.
Going through a backward jump is equivalent to loop peeling. Even if the loops are strangely shaped. If we stop tracing on the backwards jump then we should, in theory, still be able to jit generators, just spread across more than one trace. (The first trace stops at the back edge, and the second one extends that back into the caller) Unfortunately we don't record state on exits, so we would lose a lot of information in the second trace if we did that.
I don't suppose you have stats for this?
I don't suppose you have stats for this?
No I do not.
How can you tell the impact of allowing the extra backward edge in the trace? I suspect that it is beneficial, but some data would be good. Any toy examples with a dump of traces?
For this example:
def f(n):
for i in range(n):
yield i + i
def testfunc(n):
for _ in f(n):
pass
import sys
testfunc(4096)
sys._dump_tracelets("hello.gvz")
Before this PR dumps:
digraph ideal {
rankdir = "LR"
executor_0x590ef51207a0 [
shape = none
label = <<table border="0" cellspacing="0">
<tr><td port="start" border="1" ><b>Executor</b></td></tr>
<tr><td border="1" >testfunc: 6</td></tr>
<tr><td port="i0" border="1" >_START_EXECUTOR op0=97920776013728</td></tr>
<tr><td port="i1" border="1" >_MAKE_WARM op0=0</td></tr>
<tr><td port="i2" border="1" >_SET_IP op0=129295294395342</td></tr>
<tr><td port="i3" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i4" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i5" border="1" >_SET_IP op0=129295294395336</td></tr>
<tr><td port="i6" border="1" >_FOR_ITER_GEN_FRAME op0=0</td></tr>
<tr><td port="i7" border="1" >_SPILL_OR_RELOAD op0=129295304702230</td></tr>
<tr><td port="i8" border="1" >_PUSH_FRAME op0=129295294845520</td></tr>
<tr><td port="i9" border="1" >_GUARD_IP__PUSH_FRAME op0=129295304702226</td></tr>
<tr><td port="i10" border="1" >_TIER2_RESUME_CHECK op0=0</td></tr>
<tr><td port="i11" border="1" >_SET_IP op0=129295304702228</td></tr>
<tr><td port="i12" border="1" >_SPILL_OR_RELOAD op0=0</td></tr>
<tr><td port="i13" border="1" >_POP_TOP op0=0</td></tr>
<tr><td port="i14" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i15" border="1" >_SET_IP op0=129295304702230</td></tr>
<tr><td port="i16" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i17" border="1" >_EXIT_TRACE op0=97920776013856</td></tr>
</table>>
]
}
After this PR:
digraph ideal {
rankdir = "LR"
executor_0x63db39341ae0 [
shape = none
label = <<table border="0" cellspacing="0">
<tr><td port="start" border="1" ><b>Executor</b></td></tr>
<tr><td border="1" >testfunc: 6</td></tr>
<tr><td port="i0" border="1" >_START_EXECUTOR op0=109793208703712</td></tr>
<tr><td port="i1" border="1" >_MAKE_WARM op0=0</td></tr>
<tr><td port="i2" border="1" >_SET_IP op0=128718879585230</td></tr>
<tr><td port="i3" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i4" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i5" border="1" >_SET_IP op0=128718879585224</td></tr>
<tr><td port="i6" border="1" >_FOR_ITER_GEN_FRAME op0=0</td></tr>
<tr><td port="i7" border="1" >_SPILL_OR_RELOAD op0=128718889892118</td></tr>
<tr><td port="i8" border="1" >_PUSH_FRAME op0=128718880035408</td></tr>
<tr><td port="i9" border="1" >_GUARD_IP__PUSH_FRAME op0=128718889892114</td></tr>
<tr><td port="i10" border="1" >_TIER2_RESUME_CHECK op0=0</td></tr>
<tr><td port="i11" border="1" >_POP_TOP_NOP op0=0</td></tr>
<tr><td port="i12" border="1" >_SET_IP op0=128718889892118</td></tr>
<tr><td port="i13" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i14" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i15" border="1" >_ITER_CHECK_RANGE op0=0</td></tr>
<tr><td port="i16" border="1" >_GUARD_NOT_EXHAUSTED_RANGE op0=0</td></tr>
<tr><td port="i17" border="1" >_ITER_NEXT_RANGE op0=0</td></tr>
<tr><td port="i18" border="1" >_SET_IP op0=128718889892096</td></tr>
<tr><td port="i19" border="1" >_SPILL_OR_RELOAD op0=0</td></tr>
<tr><td port="i20" border="1" >_STORE_FAST_1 op0=0</td></tr>
<tr><td port="i21" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i22" border="1" >_LOAD_FAST_BORROW_1 op0=0</td></tr>
<tr><td port="i23" border="1" >_LOAD_FAST_BORROW_1 op0=0</td></tr>
<tr><td port="i24" border="1" >_GUARD_TOS_OVERFLOWED op0=0</td></tr>
<tr><td port="i25" border="1" >_BINARY_OP_ADD_INT op0=0</td></tr>
<tr><td port="i26" border="1" >_POP_TOP_NOP op0=0</td></tr>
<tr><td port="i27" border="1" >_POP_TOP_NOP op0=0</td></tr>
<tr><td port="i28" border="1" >_SET_IP op0=128718889892112</td></tr>
<tr><td port="i29" border="1" >_YIELD_VALUE op0=128718880031568</td></tr>
<tr><td port="i30" border="1" >_GUARD_IP_YIELD_VALUE op0=128718879585228</td></tr>
<tr><td port="i31" border="1" >_SET_IP op0=128718879585228</td></tr>
<tr><td port="i32" border="1" >_STORE_FAST_1 op0=0</td></tr>
<tr><td port="i33" border="1" >_JUMP_TO_TOP op0=0</td></tr>
</table>>
]
}
The trace is significantly improved.
In the above example, the trace is wrongly cut short because of the JUMP_BACKWARD caused by the inner generator.
I patched main with just the changes to optimizer_bytecodes.c, but not the change to JUMP_BACKWARD handling, and I get this pair of traces (running for a bit longer to warm the side exit):
digraph ideal {
rankdir = "LR"
executor_0xaca35b8e89d0 [
shape = none
label = <<table border="0" cellspacing="0">
<tr><td port="start" border="1" ><b>Executor</b></td></tr>
<tr><td border="1" >f: 3</td></tr>
<tr><td port="i0" border="1" >_START_EXECUTOR op0=189817615714768</td></tr>
<tr><td port="i1" border="1" >_MAKE_WARM op0=0</td></tr>
<tr><td port="i2" border="1" >_SET_IP op0=248217676888342</td></tr>
<tr><td port="i3" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i4" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i5" border="1" >_ITER_CHECK_RANGE op0=0</td></tr>
<tr><td port="i6" border="1" >_GUARD_NOT_EXHAUSTED_RANGE op0=0</td></tr>
<tr><td port="i7" border="1" >_ITER_NEXT_RANGE op0=0</td></tr>
<tr><td port="i8" border="1" >_SET_IP op0=248217676888320</td></tr>
<tr><td port="i9" border="1" >_SWAP_FAST_1 op0=0</td></tr>
<tr><td port="i10" border="1" >_SPILL_OR_RELOAD op0=248217676888336</td></tr>
<tr><td port="i11" border="1" >_POP_TOP op0=0</td></tr>
<tr><td port="i12" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i13" border="1" >_LOAD_FAST_BORROW_1 op0=0</td></tr>
<tr><td port="i14" border="1" >_LOAD_FAST_BORROW_1 op0=0</td></tr>
<tr><td port="i15" border="1" >_GUARD_TOS_OVERFLOWED op0=0</td></tr>
<tr><td port="i16" border="1" >_BINARY_OP_ADD_INT op0=0</td></tr>
<tr><td port="i17" border="1" >_POP_TOP_NOP op0=0</td></tr>
<tr><td port="i18" border="1" >_POP_TOP_NOP op0=0</td></tr>
<tr><td port="i19" border="1" >_SET_IP op0=248217676888336</td></tr>
<tr><td port="i20" border="1" >_YIELD_VALUE op0=248217676203216</td></tr>
<tr><td port="i21" border="1" >_GUARD_IP_YIELD_VALUE op0=248217675777164</td></tr>
<tr><td port="i22" border="1" >_SET_IP op0=248217675777164</td></tr>
<tr><td port="i23" border="1" >_SWAP_FAST_1 op0=0</td></tr>
<tr><td port="i24" border="1" >_POP_TOP op0=0</td></tr>
<tr><td port="i25" border="1" >_EXIT_TRACE op0=189817615714896</td></tr>
</table>>
]
executor_0xaca35b8e89d0:i25 -> executor_0xaca35b8f5fc0:start
executor_0xaca35b8f5fc0 [
shape = none
label = <<table border="0" cellspacing="0">
<tr><td port="start" border="1" ><b>Executor</b></td></tr>
<tr><td border="1" >testfunc: 6</td></tr>
<tr><td port="i0" border="1" >_START_EXECUTOR op0=189817615769536</td></tr>
<tr><td port="i1" border="1" >_MAKE_WARM op0=0</td></tr>
<tr><td port="i2" border="1" >_SET_IP op0=248217675777166</td></tr>
<tr><td port="i3" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i4" border="1" >_CHECK_VALIDITY op0=0</td></tr>
<tr><td port="i5" border="1" >_SET_IP op0=248217675777160</td></tr>
<tr><td port="i6" border="1" >_FOR_ITER_GEN_FRAME op0=0</td></tr>
<tr><td port="i7" border="1" >_SPILL_OR_RELOAD op0=248217676888342</td></tr>
<tr><td port="i8" border="1" >_PUSH_FRAME op0=248217675654416</td></tr>
<tr><td port="i9" border="1" >_GUARD_IP__PUSH_FRAME op0=248217676888338</td></tr>
<tr><td port="i10" border="1" >_TIER2_RESUME_CHECK op0=0</td></tr>
<tr><td port="i11" border="1" >_POP_TOP_NOP op0=0</td></tr>
<tr><td port="i12" border="1" >_SET_IP op0=248217676888342</td></tr>
<tr><td port="i13" border="1" >_CHECK_PERIODIC op0=0</td></tr>
<tr><td port="i14" border="1" >_EXIT_TRACE op0=189817615769664</td></tr>
</table>>
]
executor_0xaca35b8f5fc0:i14 -> executor_0xaca35b8e89d0:start
}
Which covers the same code as the single trace in your example.
So, we don't need the changes to JUMP_BACKWARD to support generator frames, but it does give a better trace in this case.
Given that, could we make the changes to backward edge handling in another PR, as I expect they are more general than just for generators and I would like to keep the changes separate.
@markshannon removed the changes and limited it only to the optimizer.
You'll need to update the test, which is a bit awkward as there are two executors to check if you're looking for the _YIELD_VALUE.
We really need a proper test framework for the optimizer.
@markshannon test fixed and verified on my machine.
Android build failure looks unrelated