y-sq
y-sq
About the test failures: 1. Some `aot_eager` tests failed due to: ``` E RecursionError: maximum recursion depth exceeded while calling a Python object /opt/conda/envs/venv/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:309: RecursionError _ test_aot_eager[dtype2-True-ScalingType.DYNAMIC-ScalingType.DELAYED-ScalingType.STATIC-True-True] _ ``` From...
@vkuzo , it's not included in triton 3.1. The pr (#4222) shows in the difference between 3.1.x and main: https://github.com/triton-lang/triton/compare/release/3.1.x...main
@pytorchbot label "topic: not user facing"
@vkuzo thanks for sharing the error. ~~Let me look into it.~~ Updates: I updated the pr to fix this error, also attached the trace + generated kernels in the diff's...
@vkuzo thanks for sharing the SAC script. I'll check the SAC case. In SAC, is the following the ideal case for delayed scaling? Avtivations: - In fwd, amax + cast...
> How does this interact with the cooperative reductions from @jansel ? @Chillee , currently this change should have no interaction with the cooperative reductions. The option "defer_reduction_split" only takes...
Hi @jansel and @eellison , thanks for your comments about the cooperative reductions. I did some tests to see if it can directly help the fp8 cases. **Context of float8...
Updates of cooperative reductions performance (cc @eellison) I re-ran the single reduction case with `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1` as @eellison suggested, but the performance didn't change much: With `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1`: Input shape = `torch.randn(3072,...
I tried different TRITON_MAX_RSPLIT. With a proper RSPLIT value, the performance is very close to split_reduction. TRITON_MAX_RSPLIT = 256, ``` TRITON KERNELS BANDWIDTH INFO (/tmp/torchinductor_shuqiyang/tmpf4fc6aqq/fi/cfikuh67balrjcaj27uqvppm44ot7hk6vq4llisvdfwv2ke3ycs5.py) 0.032ms 0.050 GB 1590.97GB/s 100.00%...
@eellison thanks for the quick response. yes, I'll then work on the fx pass solution, and likely land that to torchao repo...? (cc @vkuzo the idea of the fx pass...