cuda-python RFC: Add support for `launch_attr` in `LaunchConfig` ctor

Today, LaunchConfig only supports cuLaunchKernel driver API to launch kernels on a single GPU. When extending to broader usecases where there is a need for inter-SM synchronization or multi-GPU synchronization, one would need to use cuLaunchCooperativeKernel to launch kernels safely in a deadlock-free manner. To support this, one could extend LaunchConfig(..., launch_attr=None) with an optional launch_attr that could set equivalent cuda-python data-type for CUlaunchAttribute.

Background: This issue came out of discussion: https://github.com/NVIDIA/numba-cuda/issues/128#issuecomment-2702689412 where existing implementation of cuda driver bindings in numba-cuda uses cuLaunchCooperativeKernel or cuLaunchKernel based on the existence of grid.sync() in the kernel and in the effort to migrate it to cuda.core, one would need to provide the capability to select launch kernel API variant at runtime based on the LaunchConfig.

Mar 06 '25 20:03 realarnavgoel

It seems we just need to add LaunchConfig.cooperative_launch: bool = False, and map it to CU_LAUNCH_ATTRIBUTE_COOPERATIVE when preparing the launch config for cuLaunchKernelEx. What would be the right way to test this capability? Write a dummy kernel that only calls grid.sync() and launch it with/without setting the boolean to True?

Mar 07 '25 02:03 leofang

To make it 100% clear: An internal dispatching to cuLaunchCooperativeKernel() is not necessary. cuLaunchKernelEx() is guaranteed to work if the attribute is set.

Mar 07 '25 02:03 leofang

It seems we just need to add LaunchConfig.cooperative_launch: bool = False, and map it to CU_LAUNCH_ATTRIBUTE_COOPERATIVE when preparing the launch config for cuLaunchKernelEx. What would be the right way to test this capability? Write a dummy kernel that only calls grid.sync() and launch it with/without setting the boolean to True?

I think its more than just that using the appropriate launch attribute to avoid deadlocks/hangs. To make sure that the concerns are separated by design choices, I am separating the discussion into 2 options

Option (a) if LaunchConfig only exposes cooperative_launch attribute, then in the usecase of multi-GPU or grid-sync, there needs to be additional legalization checks to make sure that gridDims is consistent across all GPUs/all blocks and they must not exceed more than the max SMs per GPU queryable using cudaOccupancyMaxActiveBlocksPerMultiprocessor. e.g. https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html#cooperative-launches
Option (b) if LaunchConfig exposes generic class object LaunchAttribute that maps 1:1 to CUlaunchAttribute, wouldn't it allow for more extendible interface for other launch attribute usecases ?

What would be the right way to test this capability? Write a dummy kernel that only calls grid.sync() and launch it with/without setting the boolean to True?

I think there are few different things to test

Throw errors if launch fails because of too many blocks or inconsistent blocks across GPUs
Throw errors if launch hangs due to use of grid.sync without cuLaunchKernelEx(CU_LAUNCH_ATTRIBUTE_COOPERATIVE)
Throw errors if launch requested is cooperative, but driver support is not present

Mar 10 '25 05:03 realarnavgoel

Option (a) if LaunchConfig only exposes cooperative_launch attribute, then in the usecase of multi-GPU or grid-sync, there needs to be additional legalization checks to make sure that gridDims is consistent across all GPUs/all blocks and they must not exceed more than the max SMs per GPU queryable using cudaOccupancyMaxActiveBlocksPerMultiprocessor. e.g. https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html#cooperative-launches

Option (b) if LaunchConfig exposes generic class object LaunchAttribute that maps 1:1 to CUlaunchAttribute, wouldn't it allow for more extendible interface for other launch attribute usecases ?

My current thought is leaning toward option a, but I do see the value of cuda.core doing some sanity checks for users. For example, we do this for checking thread block cluster support: https://github.com/NVIDIA/cuda-python/blob/96d3ba008f05a2a864da097afca8d14c5c038bbf/cuda_core/cuda/core/experimental/_launcher.py#L73-L83 However, it should still be as lightweight as possible. We don't want cuda.core.launch() to show up in the critical path.

Mar 10 '25 20:03 leofang

FWIW the occupancy query APIs are tracked in #504, due to the request from the CUTLASS team, but it seems needed as part of this discussion too.

Mar 10 '25 20:03 leofang

Option (a) if LaunchConfig only exposes cooperative_launch attribute, then in the usecase of multi-GPU or grid-sync, there needs to be additional legalization checks to make sure that gridDims is consistent across all GPUs/all blocks and they must not exceed more than the max SMs per GPU queryable using cudaOccupancyMaxActiveBlocksPerMultiprocessor. e.g. https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html#cooperative-launches

Option (b) if LaunchConfig exposes generic class object LaunchAttribute that maps 1:1 to CUlaunchAttribute, wouldn't it allow for more extendible interface for other launch attribute usecases ?

My current thought is leaning toward option a, but I do see the value of cuda.core doing some sanity checks for users. For example, we do this for checking thread block cluster support:

cuda-python/cuda_core/cuda/core/experimental/_launcher.py

Lines 73 to 83 in 96d3ba0

if self.cluster is not None: if not _use_ex: err, drvers = driver.cuDriverGetVersion() drvers_fmt = f" (got driver version {drvers})" if err == driver.CUresult.CUDA_SUCCESS else "" raise CUDAError(f"thread block clusters require cuda.bindings & driver 11.8+{drvers_fmt}") cc = Device().compute_capability if cc < (9, 0): raise CUDAError( f"thread block clusters are not supported on devices with compute capability < 9.0 (got {cc})" ) self.cluster = cast_to_3_tuple("LaunchConfig.cluster", self.cluster)

However, it should still be as lightweight as possible. We don't want cuda.core.launch() to show up in the critical path.

Great. We could also add attribute option to make checks conditionals for power-users, if needed to optimize for latency in critical path if needed. I am okay with option (a) too and add others on a usecase basis.

Mar 13 '25 22:03 realarnavgoel

@leofang I did some more digging and there are additional usecases in multi-GPU kernels where the following launch attributes can be used. This makes option (b) more attractive.

(a) Usecase: Enables Cooperative Group Arrays on SM90+

CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION
CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE

(b) Usecase: Enables memory sync interference avoidance between kernels in different domains (SM90+)

CU_LAUNCH_ATTRIBUTE_MEM_SYNC_DOMAIN

(c) Usecase: Enables tracing events with CUDA graphs

CU_LAUNCH_ATTRIBUTE_LAUNCH_COMPLETION_EVENT

Mar 14 '25 17:03 realarnavgoel

Throw errors if launch hangs due to use of grid.sync without cuLaunchKernelEx(CU_LAUNCH_ATTRIBUTE_COOPERATIVE)

This does not seem to hang; it seems to raise a sticky CUDA_ERROR_LAUNCH_FAILED error. @pciolkosz Does this sound right?

Jun 03 '25 05:06 leofang

Yes, in grid.sync() there is a check if the launch was cooperative and I think it traps if it wasn't, so the observed behavior makes sense

Jun 03 '25 16:06 pciolkosz

Cooperative launch is implemented in #676. Moving this issue to P1 / parking lot. We still want to keep this open for covering more launch attributes.

Jun 06 '25 13:06 leofang