[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas?

Open xiaonans opened this issue 1 year ago • 0 comments

What is your question? I want to write my own fused fp16xfp16 gemm kernel with CUTE, but I can not find a tutorial/sample code with a performance comparable to cublas.

I noticed there are some tutorials in https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial, which has fp32xfp32 and int8xint8 gemm. But the performance of int8xint8 gemm is not good enough. I also noticed a 3rd party of fp16xfp16 gemm with CUTE https://github.com/leimao/CUDA-GEMM-Optimization?tab=readme-ov-file, but as shown in the readme, the performance is yet not comparable to cublas. So I wonder whether CUTE can give an official fp16xfp16 gemm kernel with good performance, so that I can develop based on that?

Aug 06 '24 07:08 xiaonans