[QST] [WGMMA] Throughput Improvement
I am currently testing examples/cute/tutorials/hopper/wgmma_sm90.cu on H800. The throughput is a quite low, ~240TFLOPs, compared to the paper in the same case, ~970TFLOPS. So, I used Nsight Compute to profile and found out the Compute Throughput is only 34%, and Memory Throughput is 56%.
It seems like the bottleneck should be at the compute side. I am pretty sure WGMMA is working, because wgmma.async is shown in the .ptx file and HGMMA is shown in the SASS contents. I didn't change anything in the code, and every time I compiled, I got this msg:
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function
It is not an error, just a warning, but it sounds like a big deal. So, I am not sure if it is the reason why I got a low throughput.
This is my compilation code:
#!/bin/bash
INCLUDE_PATH1="../../../../../cutlass/include"
INCLUDE_PATH2="../../../../../cutlass/tools/util/include"
INCLUDE_PATH3="../../../../../cutlass/examples/common"
nvcc -I${INCLUDE_PATH1} -I${INCLUDE_PATH2} -I${INCLUDE_PATH3} --std=c++17 --expt-relaxed-constexpr -O1 -o wgmma_sm90 wgmma_sm90.cu -arch=sm_90a -DCUTE_ARCH_SM90A_ENABLED -lcuda
nvcc -I${INCLUDE_PATH1} -I${INCLUDE_PATH2} -I${INCLUDE_PATH3} --std=c++17 --expt-relaxed-constexpr -O1 -ptx wgmma_sm90.cu -arch=sm_90a -o wgmma_sm90.ptx
I noticed in the previous Issue, one guy mentioned choosing a different option like "-O2" could help. But it is not in my case. I tried O0, O1, O2, O3. The msg is still showing up.
This is my NVCC version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
This is my CUDA and Driver info:
NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2
What I really want to achieve is to get Compute Throughput as high as possible. So, any help about improving the Compute Throughput would be appreciated.
Please use our cmake to generate the compile flags. We don't support own rolled build systems
examples/cute/tutorials/hopper/wgmma_sm90.cu is mainly an educational example / tutorial - please use the CUTLASS profiler for measuring the best / most performant config.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.