CUDALibrarySamples cusparseLtMatmul example is much slower than cublasGemmEx

Hi, guys,

I compiled the example code for cusparseLt here: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul, which I used the default problem size, and used Nsight systems to profile the execution. I found it launched many kernels, which make the process slow:

cusparseLt: cublas:

I then tried increase the problem size m, n, k to 320, 320, 640, cusparseLt is much slower,

cusparseLt: cublas:

I used libcusparseLt.so.0.6.3.2, which is installed using apt-get following the official guide. CUDA version: 12.2; Hardware: NV A100 I am also wondering if it is expected that the libs are installed in /usr/lib/x86_64-linux-gnu but not the CUDA directory.

Any advice is appreciated! Thanks.

Oct 24 '24 01:10 SimonSongg

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Oct 24 '24 21:10 j4yan

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Hi @j4yan, thanks for reply. This is the code I used to test CUBLAS:

#include <iostream>
#include <cstdlib>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
using AB_t         = __half;
using C_t          = __half;
using COMPUTE_t    = float;
#define CHECK_CUDA(call)                                                     \
    {                                                                        \
        cudaError_t err = (call);                                            \
        if (err != cudaSuccess) {                                            \
            std::cerr << "CUDA error: " << cudaGetErrorString(err)           \
                      << " (code " << err << ") at " << __LINE__ << std::endl; \
            return EXIT_FAILURE;                                             \
        }                                                                    \
    }

#define CHECK_CUBLAS(call)                                                   \
    {                                                                        \
        cublasStatus_t err = (call);                                         \
        if (err != CUBLAS_STATUS_SUCCESS) {                                  \
            std::cerr << "cuBLAS error: " << err                             \
                      << " at " << __LINE__ << std::endl;                    \
            return EXIT_FAILURE;                                             \
        }                                                                    \
    }

int main() {
    constexpr int m = 320;
    constexpr int n = 320;
    constexpr int k = 640;
    float alpha = 1.0f;
    float beta = 1.0f;

    size_t sizeA = m * k * sizeof(__half);
    size_t sizeB = k * n * sizeof(__half);
    size_t sizeC = m * n * sizeof(__half);

    auto     hA             = new AB_t[sizeA / sizeof(AB_t)];
    auto     hB             = new AB_t[sizeB / sizeof(AB_t)];
    auto     hC             = new C_t[sizeC / sizeof(C_t)];

    // Initialize matrices
    for (int i = 0; i < m * k; i++) 
        hA[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2)); // -2 ~ 2

    for (int i = 0; i < k * n; I++)
        hB[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2));

    for (int i = 0; i < m * n; I++)
        hC[i] = static_cast<C_t>(static_cast<float>(std::rand() % 5 - 2));

    float *dA, *dB, *dC;
    CHECK_CUDA(cudaMalloc((void**)&dA, sizeA));
    CHECK_CUDA(cudaMalloc((void**)&dB, sizeB));
    CHECK_CUDA(cudaMalloc((void**)&dC, sizeC));

    CHECK_CUDA(cudaMemcpy(dA, hA, sizeA, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(dB, hB, sizeB, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(dC, hC, sizeC, cudaMemcpyHostToDevice));
    cublasHandle_t handle;
    CHECK_CUBLAS(cublasCreate(&handle));

    // CUDA Timer setup
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    
    CHECK_CUDA(cudaEventRecord(start, nullptr));
    
    CHECK_CUBLAS(cublasGemmEx(handle,
                              CUBLAS_OP_N, CUBLAS_OP_N,
                              m, n, k,
                              &alpha,
                              dA, CUDA_R_16F, m,
                              dB, CUDA_R_16F, k,
                              &beta,
                              dC, CUDA_R_16F, m,
                              CUDA_R_32F,
                              CUBLAS_GEMM_DEFAULT));

    CHECK_CUDA(cudaEventRecord(stop, nullptr));
    CHECK_CUDA(cudaEventSynchronize(stop));

    float milliseconds = 0;
    CHECK_CUDA(cudaEventElapsedTime(&milliseconds, start, stop));
    std::cout << "Matrix multiplication took " << milliseconds << " ms" << std::endl;

    CHECK_CUDA(cudaMemcpy(hC, dC, sizeC, cudaMemcpyDeviceToHost));

    // Cleanup
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
    free(hA);
    free(hB);
    free(hC);
    cublasDestroy(handle);

    return 0;
}

I tried to use FP16 to align with the example code provided for cuSPARSELt. Still the same conclusion.

I am wondering, whether the behavior that tons of kernels are launched during the execution of cusparseLt example code (as I provided previously) is expected. It looks weird. I just copy paste the example code in this repository. Is there any bug in the example code that leads to this weird behavior?

Thanks!

Oct 25 '24 02:10 SimonSongg

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Hi, @j4yan I found if set matmul_search=false, only one kernel will be launched as below, and the calculation result is correct.

Oct 25 '24 02:10 SimonSongg

Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. I'd suggest timing many matmul calls using cudaEvent. Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).

Oct 28 '24 21:10 j4yan

Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. I'd suggest timing many matmul calls using cudaEvent. Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).

Thanks for reply.

Why the matmul_search will make the gemm launch so many kernels? And I did use small problem size 320 x 320 x 640, and I use a for loop to run it 10 times, the latency seems similar to dense gemm. It might be due to the layout? I will check it soon.

Thanks!

Oct 29 '24 13:10 SimonSongg

@SimonSongg cusparseLtMatmulSearch() is the auto-tuning API. Sorry I mean for very small sizes you won't observe much speedup.

Oct 29 '24 16:10 j4yan