Jianfeng Yan
Jianfeng Yan
@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?
Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense...
@SimonSongg cusparseLtMatmulSearch() is the auto-tuning API. Sorry I mean for very small sizes you **won't** observe much speedup.
@Septend-fun Is it possible to make a reproducer?
Hi @Zor-X-L In order to use fp4 you have to specify the scale modes of A/B/output matrices and the corresponding scale pointers; see https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltmatmuldescattribute-t. Could you give it a try?...
Hi @Zor-X-L 1. Could you try half of CompressedSize() for compressed_size? This's actually a bug for fp4 and will be fixed in next release. 2. Yes it's the right way...
@Zor-X-L 1. You are right. Half of the CompressedSize() is not correct because it also halves the amount of metadata. 2. Yes please try batched. I can't think of any...
@zhoeujei 1. In [the documentation](https://docs.nvidia.com/cuda/cusparselt/#cusparselt-a-high-performance-cuda-library-for-sparse-matrix-matrix-multiplication), "NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which **at least** one operand is a sparse matrix" 2. Currently only...
@zhoeujei Just to followup. Does the above rely resolve your issue? If yes, let's close.
@JanuszL I think we can close.