Pradeep Ramani

Results 8 comments of Pradeep Ramani

`BLOCK_M >= 128` requirement likely comes from the fact only `cooperative` kernel support exists today in CUTLASS 3.5. If support for other kernel schedules is added (`tma_warpspecialized` or `tma_warpspecialized_pingpong`) -...

It's possible to do Row x Row - but not as performant, since we need to transpose in the fly. @hwu36

Could you share more nfo on what exact c++ kernel is being picked in both cases ? You may have to pick a custom tile size instead of what the...

CTA tile is roughly decided based on the following : * An integer multiple of MMA instruction shape M, N, K * Having 128B bytes in the contiguous dimension (not...

Closing PR since it has been merged into 3.5.1 as [examples/cute/tutorial/wgmma_sm90.cu](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/wgmma_sm90.cu)

[examples/cute/tutorials/hopper/wgmma_sm90.cu](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/hopper/wgmma_sm90.cu) is mainly an educational example / tutorial - please use the CUTLASS profiler for measuring the best / most performant config.

CuTeDSL Requires a minimum of CTK 12.9 and the appropriate driver version is `>=525.60.13` You can refer here for the driver-toolkit compatibility matrix in general : https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id6 For more details...