[QST][CuTe] How to dump ptxas compiling information in cute dsl?
Previsouly, we can use --ptxas-options=--verbose to dump ptxas info, including used regsiter number/barriers and whether the kernel has register spill.
There feature is on the roadmap. Hoping to have it out later this year
I found that we can use driver api cuFuncGetAttribute to get local_size_bytes for analyzing register spill. The example code with cute dsl kernel is as follows (I don't know whether it is a recommended way to do this):
def dump_kernel_attributes(compiled_kernel):
from cuda.bindings import driver
from cutlass.utils import HardwareInfo
import torch
hardware_info = HardwareInfo(device_id=torch.cuda.current_device())
kernel = next(iter(compiled_kernel.cuda_modules.modules)).kernel_ptr
# more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b
device_id = torch.cuda.current_device()
local_size_bytes = hardware_info._checkCudaErrors(
driver.cuFuncGetAttribute(
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES,
kernel,
)
)
num_regs = hardware_info._checkCudaErrors(
driver.cuFuncGetAttribute(
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS,
kernel,
)
)
print(f"--- Kernel Info ---")
print(f"local_size_bytes: {local_size_bytes}")
print(f"num_regs: {num_regs}")
print(f"--- End Kernel Info ---")
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
I found that we can use driver api cuFuncGetAttribute to get local_size_bytes for analyzing register spill. The example code with cute dsl kernel is as follows (I don't know whether it is a recommended way to do this):
def dump_kernel_attributes(compiled_kernel): from cuda.bindings import driver from cutlass.utils import HardwareInfo import torch hardware_info = HardwareInfo(device_id=torch.cuda.current_device()) kernel = next(iter(compiled_kernel.cuda_modules.modules)).kernel_ptr # more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b device_id = torch.cuda.current_device() local_size_bytes = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES, kernel, ) ) num_regs = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, kernel, ) )
print(f"--- Kernel Info ---") print(f"local_size_bytes: {local_size_bytes}") print(f"num_regs: {num_regs}") print(f"--- End Kernel Info ---")
Hi,
Thanks for sharing your method. What is the expected type for the compiled_kernel parameter? I'm getting an error when I try your method:
kernel = next(iter(compiled_kernel.cuda_modules.modules)).kernel_ptr
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'CudaDialectJitCompiledFunction' object has no attribute 'cuda_modules'
Thanks for your help!
The api is updated after cute dsl 4.3. The right way now is as following:
(Note that the kernel should be compiled with --keep-cubin, and the compiled_kernel is the output of cute.compile)
def dump_kernel_attributes(compiled_kernel):
from cuda.bindings import driver
from cutlass.utils import HardwareInfo
import torch
device_id = torch.cuda.current_device()
hardware_info = HardwareInfo(device_id=device_id)
cubin_data = compiled_kernel.artifacts.CUBIN
assert cubin_data is not None, "cubin_data is None, need '--keep-cubin' option when compiling"
cuda_library = hardware_info._checkCudaErrors(
driver.cuLibraryLoadData(cubin_data, None, None, 0, None, None, 0)
)
kernels = hardware_info._checkCudaErrors(driver.cuLibraryEnumerateKernels(1, cuda_library))
kernel = hardware_info._checkCudaErrors(driver.cuKernelGetFunction(kernels[0]))
# more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b
local_size_bytes = hardware_info._checkCudaErrors(
driver.cuFuncGetAttribute(
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES,
kernel,
)
)
num_regs = hardware_info._checkCudaErrors(
driver.cuFuncGetAttribute(
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS,
kernel,
)
)
print(f"--- Kernel Info ---")
print(f"local_size_bytes: {local_size_bytes}")
print(f"num_regs: {num_regs}")
print(f"--- End Kernel Info ---")
The api is updated after cute dsl 4.3. The right way now is as following:
(Note that the kernel should be compiled with --keep-cubin, and the
compiled_kernelis the output of cute.compile)def dump_kernel_attributes(compiled_kernel): from cuda.bindings import driver from cutlass.utils import HardwareInfo import torch device_id = torch.cuda.current_device() hardware_info = HardwareInfo(device_id=device_id) cubin_data = compiled_kernel.artifacts.CUBIN assert cubin_data is not None, "cubin_data is None, need '--keep-cubin' option when compiling" cuda_library = hardware_info._checkCudaErrors( driver.cuLibraryLoadData(cubin_data, None, None, 0, None, None, 0) ) kernels = hardware_info._checkCudaErrors(driver.cuLibraryEnumerateKernels(1, cuda_library)) kernel = hardware_info._checkCudaErrors(driver.cuKernelGetFunction(kernels[0])) # more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b local_size_bytes = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES, kernel, ) ) num_regs = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, kernel, ) )
print(f"--- Kernel Info ---") print(f"local_size_bytes: {local_size_bytes}") print(f"num_regs: {num_regs}") print(f"--- End Kernel Info ---")
Thanks a lot, I will try it!