JohnAlphaIII

Results 2 issues of JohnAlphaIII

Warp kernel crashes for some input data in fp16 and bf16. E.g. ``` [B C T ] [2, 2, 32768] -- works [4, 2, 32768] -- doesn't [2, 4, 32768]...

Hi, there is a performance [benchmark](https://github.com/NVIDIA/cutlass/blob/main/media/images/cutlass-3.8-blackwell-gemm-peak-performance.svg) in README.md, but there is no link to the code to reproduce it. Can you please point me to the source code for this...

question
? - Needs Triage