Pradeep Ramani
Pradeep Ramani
`BLOCK_M >= 128` requirement likely comes from the fact only `cooperative` kernel support exists today in CUTLASS 3.5. If support for other kernel schedules is added (`tma_warpspecialized` or `tma_warpspecialized_pingpong`) -...
It's possible to do Row x Row - but not as performant, since we need to transpose in the fly. @hwu36
Could you share more nfo on what exact c++ kernel is being picked in both cases ? You may have to pick a custom tile size instead of what the...
CTA tile is roughly decided based on the following : * An integer multiple of MMA instruction shape M, N, K * Having 128B bytes in the contiguous dimension (not...
Closing PR since it has been merged into 3.5.1 as [examples/cute/tutorial/wgmma_sm90.cu](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/wgmma_sm90.cu)
[examples/cute/tutorials/hopper/wgmma_sm90.cu](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/hopper/wgmma_sm90.cu) is mainly an educational example / tutorial - please use the CUTLASS profiler for measuring the best / most performant config.
CuTeDSL Requires a minimum of CTK 12.9 and the appropriate driver version is `>=525.60.13` You can refer here for the driver-toolkit compatibility matrix in general : https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id6 For more details...