Add split-K heuristic for decode attention

Open Aya-ZIbra opened this issue 2 months ago • 1 comments

Summary: This diff adds an automatic split-K size heuristic for the Blackwell FMHA decode kernel to optimize GPU utilization.

Added get_splitk_heuristic() that automatically computes optimal split-K size .

The heuristic ensures split sizes are multiples of TileN (256) and disables split-K when only 1 split would occur.

Performance Benchmarks show consistent 15-34% speedup over Triton split-K across all tested configurations:

Reviewed By: jianyuh

Differential Revision: D89016012

Dec 15 '25 16:12 Aya-ZIbra

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89016012.

Dec 15 '25 16:12 meta-codesync[bot]

This pull request has been merged in pytorch/FBGEMM@3086dd201373085b01da748f663285f98b1572c8.

Dec 16 '25 22:12 meta-codesync[bot]