cutlass [QST] [WGMMA] Throughput Improvement

I am currently testing examples/cute/tutorials/hopper/wgmma_sm90.cu on H800. The throughput is a quite low, ~240TFLOPs, compared to the paper in the same case, ~970TFLOPS. So, I used Nsight Compute to profile and found out the Compute Throughput is only 34%, and Memory Throughput is 56%.

It seems like the bottleneck should be at the compute side. I am pretty sure WGMMA is working, because wgmma.async is shown in the .ptx file and HGMMA is shown in the SASS contents. I didn't change anything in the code, and every time I compiled, I got this msg:

ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function

It is not an error, just a warning, but it sounds like a big deal. So, I am not sure if it is the reason why I got a low throughput.

This is my compilation code:

#!/bin/bash
INCLUDE_PATH1="../../../../../cutlass/include"
INCLUDE_PATH2="../../../../../cutlass/tools/util/include"
INCLUDE_PATH3="../../../../../cutlass/examples/common"

nvcc -I${INCLUDE_PATH1} -I${INCLUDE_PATH2} -I${INCLUDE_PATH3} --std=c++17 --expt-relaxed-constexpr -O1 -o wgmma_sm90 wgmma_sm90.cu -arch=sm_90a -DCUTE_ARCH_SM90A_ENABLED -lcuda

nvcc -I${INCLUDE_PATH1} -I${INCLUDE_PATH2} -I${INCLUDE_PATH3} --std=c++17 --expt-relaxed-constexpr -O1 -ptx wgmma_sm90.cu -arch=sm_90a -o wgmma_sm90.ptx

I noticed in the previous Issue, one guy mentioned choosing a different option like "-O2" could help. But it is not in my case. I tried O0, O1, O2, O3. The msg is still showing up.

This is my NVCC version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

This is my CUDA and Driver info:

NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2

What I really want to achieve is to get Compute Throughput as high as possible. So, any help about improving the Compute Throughput would be appreciated.

Jun 06 '25 10:06 humble-version-966

Please use our cmake to generate the compile flags. We don't support own rolled build systems

Jun 06 '25 12:06 thakkarV

examples/cute/tutorials/hopper/wgmma_sm90.cu is mainly an educational example / tutorial - please use the CUTLASS profiler for measuring the best / most performant config.

Jun 07 '25 04:06 IonThruster

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Jul 07 '25 04:07 github-actions[bot]