cutlass [QST][CuTe] How to dump ptxas compiling information in cute dsl?

Previsouly, we can use --ptxas-options=--verbose to dump ptxas info, including used regsiter number/barriers and whether the kernel has register spill.

Sep 22 '25 05:09 monellz

There feature is on the roadmap. Hoping to have it out later this year

Oct 07 '25 14:10 mnicely

I found that we can use driver api cuFuncGetAttribute to get local_size_bytes for analyzing register spill. The example code with cute dsl kernel is as follows (I don't know whether it is a recommended way to do this):

def dump_kernel_attributes(compiled_kernel):
    from cuda.bindings import driver
    from cutlass.utils import HardwareInfo
    import torch
    hardware_info = HardwareInfo(device_id=torch.cuda.current_device())
    kernel = next(iter(compiled_kernel.cuda_modules.modules)).kernel_ptr
    # more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b
    device_id = torch.cuda.current_device()
    local_size_bytes = hardware_info._checkCudaErrors(
        driver.cuFuncGetAttribute(
            driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES,
            kernel,
        )
    )
    num_regs = hardware_info._checkCudaErrors(
        driver.cuFuncGetAttribute(
            driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS,
            kernel,
        )
    )

    print(f"--- Kernel Info ---")
    print(f"local_size_bytes: {local_size_bytes}")
    print(f"num_regs: {num_regs}")
    print(f"--- End Kernel Info ---")

Oct 18 '25 08:10 monellz

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Nov 17 '25 09:11 github-actions[bot]

I found that we can use driver api cuFuncGetAttribute to get local_size_bytes for analyzing register spill. The example code with cute dsl kernel is as follows (I don't know whether it is a recommended way to do this):

def dump_kernel_attributes(compiled_kernel): from cuda.bindings import driver from cutlass.utils import HardwareInfo import torch hardware_info = HardwareInfo(device_id=torch.cuda.current_device()) kernel = next(iter(compiled_kernel.cuda_modules.modules)).kernel_ptr # more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b device_id = torch.cuda.current_device() local_size_bytes = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES, kernel, ) ) num_regs = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, kernel, ) )
print(f"--- Kernel Info ---")
print(f"local_size_bytes: {local_size_bytes}")
print(f"num_regs: {num_regs}")
print(f"--- End Kernel Info ---")

Hi,

Thanks for sharing your method. What is the expected type for the compiled_kernel parameter? I'm getting an error when I try your method:

    kernel = next(iter(compiled_kernel.cuda_modules.modules)).kernel_ptr
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'CudaDialectJitCompiledFunction' object has no attribute 'cuda_modules'

Thanks for your help!

Dec 09 '25 06:12 peakcrosser7

The api is updated after cute dsl 4.3. The right way now is as following:

(Note that the kernel should be compiled with --keep-cubin, and the compiled_kernel is the output of cute.compile)

def dump_kernel_attributes(compiled_kernel):
    from cuda.bindings import driver
    from cutlass.utils import HardwareInfo
    import torch
    device_id = torch.cuda.current_device()
    hardware_info = HardwareInfo(device_id=device_id)
    cubin_data = compiled_kernel.artifacts.CUBIN
    assert cubin_data is not None, "cubin_data is None, need '--keep-cubin' option when compiling"
    cuda_library = hardware_info._checkCudaErrors(
        driver.cuLibraryLoadData(cubin_data, None, None, 0, None, None, 0)
    )
    kernels = hardware_info._checkCudaErrors(driver.cuLibraryEnumerateKernels(1, cuda_library))
    kernel = hardware_info._checkCudaErrors(driver.cuKernelGetFunction(kernels[0]))
    # more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b
    local_size_bytes = hardware_info._checkCudaErrors(
        driver.cuFuncGetAttribute(
            driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES,
            kernel,
        )
    )
    num_regs = hardware_info._checkCudaErrors(
        driver.cuFuncGetAttribute(
            driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS,
            kernel,
        )
    )

    print(f"--- Kernel Info ---")
    print(f"local_size_bytes: {local_size_bytes}")
    print(f"num_regs: {num_regs}")
    print(f"--- End Kernel Info ---")

Dec 09 '25 06:12 monellz

The api is updated after cute dsl 4.3. The right way now is as following:

(Note that the kernel should be compiled with --keep-cubin, and the compiled_kernel is the output of cute.compile)

def dump_kernel_attributes(compiled_kernel): from cuda.bindings import driver from cutlass.utils import HardwareInfo import torch device_id = torch.cuda.current_device() hardware_info = HardwareInfo(device_id=device_id) cubin_data = compiled_kernel.artifacts.CUBIN assert cubin_data is not None, "cubin_data is None, need '--keep-cubin' option when compiling" cuda_library = hardware_info._checkCudaErrors( driver.cuLibraryLoadData(cubin_data, None, None, 0, None, None, 0) ) kernels = hardware_info._checkCudaErrors(driver.cuLibraryEnumerateKernels(1, cuda_library)) kernel = hardware_info._checkCudaErrors(driver.cuKernelGetFunction(kernels[0])) # more metrics: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b local_size_bytes = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES, kernel, ) ) num_regs = hardware_info._checkCudaErrors( driver.cuFuncGetAttribute( driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, kernel, ) )
print(f"--- Kernel Info ---")
print(f"local_size_bytes: {local_size_bytes}")
print(f"num_regs: {num_regs}")
print(f"--- End Kernel Info ---")

Thanks a lot, I will try it!

Dec 09 '25 09:12 peakcrosser7