Pradeep Ramani comments

Results 8 comments of


                                            Pradeep Ramani

[FEA] FP8 grouped gemm kernel without TMA

`BLOCK_M >= 128` requirement likely comes from the fact only `cooperative` kernel support exists today in CUTLASS 3.5. If support for other kernel schedules is added (`tma_warpspecialized` or `tma_warpspecialized_pingpong`) -...

Fix typos in convolution tests

@hwu36

[QST] Row major for int8 matrix multiplications?

It's possible to do Row x Row - but not as performant, since we need to transpose in the fly. @hwu36

[QST] Hopper mixed precision gemm always worse than FP8

Could you share more nfo on what exact c++ kernel is being picked in both cases ? You may have to pick a custom tile size instead of what the...

Add GEMM Kernel Example for Hopper H100 Tensor Cores

CTA tile is roughly decided based on the following : * An integer multiple of MMA instruction shape M, N, K * Having 128B bytes in the contiguous dimension (not...

Add GEMM Kernel Example for Hopper H100 Tensor Cores

Closing PR since it has been merged into 3.5.1 as [examples/cute/tutorial/wgmma_sm90.cu](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/wgmma_sm90.cu)

[QST] [WGMMA] Throughput Improvement

[examples/cute/tutorials/hopper/wgmma_sm90.cu](https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/hopper/wgmma_sm90.cu) is mainly an educational example / tutorial - please use the CUTLASS profiler for measuring the best / most performant config.

[QST] CuTe DSL Driver Requirements

CuTeDSL Requires a minimum of CTK 12.9 and the appropriate driver version is `>=525.60.13` You can refer here for the driver-toolkit compatibility matrix in general : https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id6 For more details...