cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] Memory-bound nvfp4 grouped gemm

Open divchenko opened this issue 9 months ago • 4 comments

I'm trying to get best memory b/w from a B200 nvfp4 grouped (ptr-based) gemm. I'm running example 75 w/

./75_blackwell_grouped_gemm_block_scaled --m=16 --n=2048 --k=7168 --groups=32
./75_blackwell_grouped_gemm_block_scaled --m=16 --n=7168 --k=2048 --groups=32

I tweaked tiles (not much to choose there) and clusters . No matter what, I'm not getting much higher than 3 TB/sec. I would like to get closer to 6 TB/sec. That's a pretty large gemm with runtime over 60us, so far away from kernel launch overheads.

If I run the same gemm w/ mxfp8, memory b/w is much closer to my desired 6 TB/sec.

divchenko avatar Apr 27 '25 01:04 divchenko

I also see same pattern for non-grouped gemm (same --m and --k as above, but I pass a very large --n to ensure that there is enough data to load).

divchenko avatar Apr 28 '25 05:04 divchenko

group gemm is supported in the profiler. you could use cutlass profiler to pick the best kernel.

cc += @ANIKET-SHIVAM

hwu36 avatar Apr 29 '25 02:04 hwu36

Update: I've transposed gemm i.e. using TNN instead of TNT with 256x64x256 tile on 2 SMs. This way N dimension is the smallest. This gives me a bit better perf, I can see it climbing closed to 4 TB/sec, but still not quite close to my desired 6 TB/sec target.

divchenko avatar Apr 29 '25 06:04 divchenko

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 29 '25 07:05 github-actions[bot]

@hwu36 @ANIKET-SHIVAM , it looks like cutlass profiler does not support nvfp4 groupgemm. Below is my command:

tools/profiler/cutlass_profiler --operation=grouped_gemm --m=4096 --n=4096 --k=4096 --num_groups=4 --runtime_input_datatype_a=e2m1 --runtime_input_datatype_b=e2m1

Any ideas of why cutlass nvfp4 group gemm is slow on memory bound cases?

fei-xx avatar Aug 17 '25 04:08 fei-xx

BTW, does blackwell_grouped_gemm_block_scaled support split-k or sliced-k?

fei-xx avatar Aug 18 '25 00:08 fei-xx

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Nov 16 '25 01:11 github-actions[bot]