CUDALibrarySamples icon indicating copy to clipboard operation
CUDALibrarySamples copied to clipboard

cuSPARSELt FP4 Issue on RTX 5090

Open Zor-X-L opened this issue 10 months ago • 6 comments

I'm trying to achieve "3352 Effective AI TOPS / TFLOPS using the Sparsity Feature" on 5090.

My code (matmul_fp4.cpp.txt) is based on the cuSPARSELt example 1, it works good on FP16, INT8 and FP8, but not FP4.

with m=n=k=32, cusparseLtMatmul failed with internal error (7) matmul_fp4-5.log

with m=n=k>=64, cusparseLtMatmulSearch failed with internal error (7) matmul_fp4-6.log

and it is interesting that compressed_size is actually bigger than A_size.

Where am I wrong?

P.S. nvidia-smi.log

Zor-X-L avatar Mar 14 '25 08:03 Zor-X-L

Hi @Zor-X-L In order to use fp4 you have to specify the scale modes of A/B/output matrices and the corresponding scale pointers; see https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltmatmuldescattribute-t. Could you give it a try? If you still can't get it working, I can share some code snippets.

For small and tiny problems like 64x64x64 compressed_size could be larger than A size.

j4yan avatar Mar 18 '25 17:03 j4yan

Hi @j4yan , thanks for the help, the program worked after I specified scales of A and B.

But I can only get around 1680-1740 TFLOPS (with 8192x8192x8192), far from 3300 TFLOPS.

Is the performance expected right now?

Things that I am not quite certain:

  1. A_size is constantly smaller than compressed_size. Is this alright?
  2. I packed the data of A and B in __nv_fp4x2_e2m1. Is this the right thing to do?
  3. For now D is actually C (inplace accumulation) and the data type of C and D is FP16. From the document there is another mode that C is FP16 and D is FP4. Will that mode be a lot faster? I'm trying but the preliminary performance results are about the same.
  4. Will do multiple matrix multiplication simultaneously help?

source code: matmul_fp4.cpp

log:

[email protected]:~/cuda-matmul-bench$ ./matmul_fp4
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=8192, compressed_size=10240, compressed_buffer_size=98368
dim_index=0, m=128, n=128, k=128, repeat=1, milliseconds=0.009280, tflops=0.451972
matmul_example test PASSED
A_size=32768, compressed_size=36864, compressed_buffer_size=98368
dim_index=1, m=256, n=256, k=256, repeat=1, milliseconds=0.007360, tflops=4.559026
matmul_example test PASSED
A_size=73728, compressed_size=86016, compressed_buffer_size=98368
dim_index=2, m=384, n=384, k=384, repeat=1, milliseconds=0.008320, tflops=13.611323
matmul_example test PASSED
A_size=131072, compressed_size=147456, compressed_buffer_size=98368
dim_index=3, m=512, n=512, k=512, repeat=1, milliseconds=0.015744, tflops=17.050016
matmul_example test PASSED
A_size=294912, compressed_size=331776, compressed_buffer_size=98368
dim_index=4, m=768, n=768, k=768, repeat=1, milliseconds=0.009664, tflops=93.746864
A_size=524288, compressed_size=589824, compressed_buffer_size=98368
dim_index=5, m=1024, n=1024, k=1024, repeat=1, milliseconds=0.008896, tflops=241.398792
A_size=991232, compressed_size=1126400, compressed_buffer_size=98368
dim_index=6, m=1408, n=1408, k=1408, repeat=1, milliseconds=0.012800, tflops=436.142080
A_size=2097152, compressed_size=2359296, compressed_buffer_size=98368
dim_index=7, m=2048, n=2048, k=2048, repeat=1, milliseconds=0.022592, tflops=760.440357
A_size=4333568, compressed_size=4898816, compressed_buffer_size=98368
dim_index=8, m=2944, n=2944, k=2944, repeat=1, milliseconds=0.050240, tflops=1015.766254
A_size=8388608, compressed_size=9437184, compressed_buffer_size=98368
dim_index=9, m=4096, n=4096, k=4096, repeat=1, milliseconds=0.100128, tflops=1372.632515
A_size=16588800, compressed_size=18708480, compressed_buffer_size=98368
dim_index=10, m=5760, n=5760, k=5760, repeat=1, milliseconds=0.274272, tflops=1393.528930
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=1, milliseconds=0.674048, tflops=1631.206769
A_size=67837952, compressed_size=76410880, compressed_buffer_size=98368
dim_index=12, m=11648, n=11648, k=11648, repeat=1, milliseconds=2.113888, tflops=1495.209738
A_size=134217728, compressed_size=150994944, compressed_buffer_size=98368
dim_index=13, m=16384, n=16384, k=16384, repeat=1, milliseconds=10.116544, tflops=869.476073
A_size=268378112, compressed_size=302110720, compressed_buffer_size=98368
dim_index=14, m=23168, n=23168, k=23168, repeat=1, milliseconds=33.278336, tflops=747.367170
A_size=536870912, compressed_size=603979776, compressed_buffer_size=98368
dim_index=15, m=32768, n=32768, k=32768, repeat=1, milliseconds=89.622368, tflops=785.169449
A_size=1073512448, compressed_size=1207701504, compressed_buffer_size=98368
dim_index=16, m=46336, n=46336, k=46336, repeat=1, milliseconds=259.053986, tflops=768.060359
[email protected]:~/cuda-matmul-bench$ ./matmul_fp4 11 11 10
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=10, milliseconds=6.615072, tflops=1662.131110
[email protected]:~/cuda-matmul-bench$ ./matmul_fp4 11 11 100
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=100, milliseconds=65.179779, tflops=1686.890742
[email protected]:~/cuda-matmul-bench$ ./matmul_fp4 11 11 1000
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=1000, milliseconds=690.939331, tflops=1591.328768
[email protected]:~/cuda-matmul-bench$ ./matmul_fp4 11 11 50
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=50, milliseconds=32.795200, tflops=1676.330140
[email protected]:~/cuda-matmul-bench$ ./matmul_fp4 11 11 500
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=500, milliseconds=336.627838, tflops=1633.126449
[email protected]:~/cuda-matmul-bench$ ./matmul_fp4 11 11 100
Usage: ./matmul_fp4 [dim_index_start] [dim_index_end] [repeat]
 default values: dim_index_start=0, dim_index_end=16, repeat=1
A_size=33554432, compressed_size=37748736, compressed_buffer_size=98368
dim_index=11, m=8192, n=8192, k=8192, repeat=100, milliseconds=65.227615, tflops=1685.653553

Zor-X-L avatar Mar 20 '25 14:03 Zor-X-L

Hi @Zor-X-L

  1. Could you try half of CompressedSize() for compressed_size? This's actually a bug for fp4 and will be fixed in next release.
  2. Yes it's the right way to pack A/B into __nv_fp4x2_e2m1.
  3. I don't expect D=fp4 to be significantly faster than D=f16. The instructions for these 2 modes share the same throughput and the overhead of loading/storing D can be mostly hidden.
  4. Do you mean batched GEMM? If yes, it's easier to achieve higher throughput especially for small and medium size due to higher SM utilization.

As for the performance, I don't have a RTX 5090 at hand so I can't tell the number is expected or not. Will update once I have access to one.

j4yan avatar Mar 25 '25 21:03 j4yan

Hi @j4yan,

  1. Half of the CompressedSize() is 56.25% of A_size for matrix size >= 256, and I got internal error with MatmulSearch() for matrix size = 4096. From the PTX doc, I think the compressed_size should be at least 62.5% of A_size, but that value doesn't work either. Dunno why. If it doesn't affect the performance I'll let it go for now.
  2. Maybe I should try batched GEMM. Any other possible ways to achieve higher TFLOPS?
  3. With RTX 5070 Ti, the best performance I can get is 874 TFLOPS (about 62% of the peak according to the whitepaper). When repeating the multiplication for about 1 minute, it drops to about 810 TFLOPS. nvidia-smi shows the power usage is either 299W or 300W, with the power cap at 300W. The situation with RTX 5090 is about the same - the power usage is either 599W or 600W. Does this mean the performance is limited by power cap and cannot be improved further?

Zor-X-L avatar Mar 26 '25 14:03 Zor-X-L

@Zor-X-L

  1. You are right. Half of the CompressedSize() is not correct because it also halves the amount of metadata.
  2. Yes please try batched. I can't think of any other ways to get better throughput. We'll keep optimize the performance.
  3. The drop is caused power throttle. We don't expect the performance at the beginning to sustain due to power limit and clocks will get lower. You can monitor the power draw and clocks with nvidia-smi --query-gpu or nvml.

j4yan avatar Apr 04 '25 18:04 j4yan

@Zor-X-L I also want to know if there is any new progress on this issue and how to achieve 3352 TOPS (FP4 with sparsity) on the RTX 5090? Thanks!

Andy1314Chen avatar Jun 29 '25 04:06 Andy1314Chen