cusparseLtMatmul example is much slower than cublasGemmEx
Hi, guys,
I compiled the example code for cusparseLt here: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul, which I used the default problem size, and used Nsight systems to profile the execution. I found it launched many kernels, which make the process slow:
cusparseLt:
cublas:
I then tried increase the problem size m, n, k to 320, 320, 640, cusparseLt is much slower,
cusparseLt:
cublas:
I used libcusparseLt.so.0.6.3.2, which is installed using apt-get following the official guide. CUDA version: 12.2; Hardware: NV A100
I am also wondering if it is expected that the libs are installed in /usr/lib/x86_64-linux-gnu but not the CUDA directory.
Any advice is appreciated! Thanks.
@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?
@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?
Hi @j4yan, thanks for reply. This is the code I used to test CUBLAS:
#include <iostream>
#include <cstdlib>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
using AB_t = __half;
using C_t = __half;
using COMPUTE_t = float;
#define CHECK_CUDA(call) \
{ \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
std::cerr << "CUDA error: " << cudaGetErrorString(err) \
<< " (code " << err << ") at " << __LINE__ << std::endl; \
return EXIT_FAILURE; \
} \
}
#define CHECK_CUBLAS(call) \
{ \
cublasStatus_t err = (call); \
if (err != CUBLAS_STATUS_SUCCESS) { \
std::cerr << "cuBLAS error: " << err \
<< " at " << __LINE__ << std::endl; \
return EXIT_FAILURE; \
} \
}
int main() {
constexpr int m = 320;
constexpr int n = 320;
constexpr int k = 640;
float alpha = 1.0f;
float beta = 1.0f;
size_t sizeA = m * k * sizeof(__half);
size_t sizeB = k * n * sizeof(__half);
size_t sizeC = m * n * sizeof(__half);
auto hA = new AB_t[sizeA / sizeof(AB_t)];
auto hB = new AB_t[sizeB / sizeof(AB_t)];
auto hC = new C_t[sizeC / sizeof(C_t)];
// Initialize matrices
for (int i = 0; i < m * k; i++)
hA[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2)); // -2 ~ 2
for (int i = 0; i < k * n; I++)
hB[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2));
for (int i = 0; i < m * n; I++)
hC[i] = static_cast<C_t>(static_cast<float>(std::rand() % 5 - 2));
float *dA, *dB, *dC;
CHECK_CUDA(cudaMalloc((void**)&dA, sizeA));
CHECK_CUDA(cudaMalloc((void**)&dB, sizeB));
CHECK_CUDA(cudaMalloc((void**)&dC, sizeC));
CHECK_CUDA(cudaMemcpy(dA, hA, sizeA, cudaMemcpyHostToDevice));
CHECK_CUDA(cudaMemcpy(dB, hB, sizeB, cudaMemcpyHostToDevice));
CHECK_CUDA(cudaMemcpy(dC, hC, sizeC, cudaMemcpyHostToDevice));
cublasHandle_t handle;
CHECK_CUBLAS(cublasCreate(&handle));
// CUDA Timer setup
cudaEvent_t start, stop;
CHECK_CUDA(cudaEventCreate(&start));
CHECK_CUDA(cudaEventCreate(&stop));
CHECK_CUDA(cudaEventRecord(start, nullptr));
CHECK_CUBLAS(cublasGemmEx(handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k,
&alpha,
dA, CUDA_R_16F, m,
dB, CUDA_R_16F, k,
&beta,
dC, CUDA_R_16F, m,
CUDA_R_32F,
CUBLAS_GEMM_DEFAULT));
CHECK_CUDA(cudaEventRecord(stop, nullptr));
CHECK_CUDA(cudaEventSynchronize(stop));
float milliseconds = 0;
CHECK_CUDA(cudaEventElapsedTime(&milliseconds, start, stop));
std::cout << "Matrix multiplication took " << milliseconds << " ms" << std::endl;
CHECK_CUDA(cudaMemcpy(hC, dC, sizeC, cudaMemcpyDeviceToHost));
// Cleanup
cudaFree(dA);
cudaFree(dB);
cudaFree(dC);
free(hA);
free(hB);
free(hC);
cublasDestroy(handle);
return 0;
}
I tried to use FP16 to align with the example code provided for cuSPARSELt. Still the same conclusion.
I am wondering, whether the behavior that tons of kernels are launched during the execution of cusparseLt example code (as I provided previously) is expected. It looks weird. I just copy paste the example code in this repository. Is there any bug in the example code that leads to this weird behavior?
Thanks!
@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?
Hi, @j4yan I found if set matmul_search=false, only one kernel will be launched as below, and the calculation result is correct.
Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. I'd suggest timing many matmul calls using cudaEvent. Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).
Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. I'd suggest timing many matmul calls using cudaEvent. Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).
Thanks for reply.
Why the matmul_search will make the gemm launch so many kernels? And I did use small problem size 320 x 320 x 640, and I use a for loop to run it 10 times, the latency seems similar to dense gemm. It might be due to the layout? I will check it soon.
Thanks!
@SimonSongg cusparseLtMatmulSearch() is the auto-tuning API. Sorry I mean for very small sizes you won't observe much speedup.