Add GEMM Kernel Example for Hopper H100 Tensor Cores
Summary:
This PR introduces a single-file example of a General Matrix Multiply (GEMM) CUDA kernel designed specifically for NVIDIA’s Hopper H100 tensor cores. The example leverages key components from CuTe and CUTLASS libraries to illustrate the essential concepts required for developing high-performance GEMM operations on this cutting-edge hardware.
Motivation:
Writing efficient GEMM kernels is critical for many high-performance computing (HPC) and deep learning applications. However, achieving this efficiency on tensor cores, particularly the latest Hopper H100, involves a steep learning curve due to the complexity of the hardware and software stack. This example serves as a concise and accessible introduction cutting through several layers of software abstractions, demonstrating the important pieces necessary to harness the power of Hopper H100 tensor cores effectively.
Features:
- Concise Implementation: ~100 lines of device code, making it easy to understand and digest.
- Showcases several critical core concepts like
- Software Pipelining
- Using TMA, Warp-group MMA, and async barrier instructions
- Tiling of global and shared memory including swizzling
- Important CuTe / CUTLASS idioms and primitives
- Educational Value / a simple baseline for quick extension and functional experimentation
How do you decide the CTA tile sizes? I assume valid sizes are defined somewhere in the instruction documentation. Can you point me to where to look? https://github.com/IonThruster/cutlass/blob/3c70422f62d95f40aa1788de827fe8aedc8d888b/examples/cute/tutorial/wgmma_sm90.cu#L284C6-L284C36
CTA tile is roughly decided based on the following :
- An integer multiple of MMA instruction shape M, N, K
- Having 128B bytes in the contiguous dimension (not necessary, but recommended)
- Having 2 or more blocks in the MMA_K to have multiple MMAs in flight (we have 4 in this ex.)
Thank you very much for your example! Could you add a verification step in the code to check if the result of CUTE is correct, for example by adding a segment of code that calls CUBLAS to perform the calculation?
half alpha_ = half(1.f);
half beta_ = half(0.f);
thrust::device_vector<TC> d_C_blas = h_C;
cublasHandle_t handle;
cublasCreate(&handle);
cublasStatus_t ret = cublasHgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N,
m, n, k,
&alpha_,
(half *)d_A.data().get(), ldA,
(half *)d_B.data().get(), ldB,
&beta_,
(half *)d_C_blas.data().get(), ldC);
if (ret != CUBLAS_STATUS_SUCCESS) {
printf("blas err = %d, str = %s\n", ret, cublasGetStatusString(ret));
}
thrust::host_vector<TC> blas_result = d_C_blas;
float threshold = 0.01;
int err_cnt = 0;
for (int i = 0; i < m * n; ++i) {
float v1 = cute_result[i];
float v2 = blas_result[i];
if (fabs(v2 - v1) > threshold) {
err_cnt++;
printf("%d: cute_result = %f, blas_result = %f\n", i, v1, v2);
}
}
if(err_cnt == 0) {
printf("cute result match blas result!\n");
}
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.
Closing PR since it has been merged into 3.5.1 as examples/cute/tutorial/wgmma_sm90.cu