FBGEMM Support bf16 in blackwell cutlass decode attention kernel

Summary:

Reduce pipeline stages to avoid exceeding smem limit
Add static_assert to make sure smem capacity violation is raised during compilation rather than runtime
Select the TMEM intrinsics based on sizeof(Element).
Update unittest to include bf16
Also label decode kernel test name with their corresponding test parameters.

Differential Revision: D82991495

Sep 23 '25 02:09 Aya-ZIbra

Name	Link
Latest commit	0887844f15928ef8facb7fe688ba0918106446b7
Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68d46815002a910008efbd3d
Deploy Preview	https://deploy-preview-4916--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sep 23 '25 02:09 netlify[bot]

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D82991495.

Sep 23 '25 02:09 facebook-github-bot

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D82991495.

Sep 24 '25 21:09 facebook-github-bot

Support bf16 in blackwell cutlass decode attention kernel