RFC: Add support for `launch_attr` in `LaunchConfig` ctor
Today, LaunchConfig only supports cuLaunchKernel driver API to launch kernels on a single GPU. When extending to broader usecases where there is a need for inter-SM synchronization or multi-GPU synchronization, one would need to use cuLaunchCooperativeKernel to launch kernels safely in a deadlock-free manner. To support this, one could extend LaunchConfig(..., launch_attr=None) with an optional launch_attr that could set equivalent cuda-python data-type for CUlaunchAttribute.
Background:
This issue came out of discussion: https://github.com/NVIDIA/numba-cuda/issues/128#issuecomment-2702689412 where existing implementation of cuda driver bindings in numba-cuda uses cuLaunchCooperativeKernel or cuLaunchKernel based on the existence of grid.sync() in the kernel and in the effort to migrate it to cuda.core, one would need to provide the capability to select launch kernel API variant at runtime based on the LaunchConfig.
It seems we just need to add LaunchConfig.cooperative_launch: bool = False, and map it to CU_LAUNCH_ATTRIBUTE_COOPERATIVE when preparing the launch config for cuLaunchKernelEx. What would be the right way to test this capability? Write a dummy kernel that only calls grid.sync() and launch it with/without setting the boolean to True?
To make it 100% clear: An internal dispatching to cuLaunchCooperativeKernel() is not necessary. cuLaunchKernelEx() is guaranteed to work if the attribute is set.
It seems we just need to add
LaunchConfig.cooperative_launch: bool = False, and map it toCU_LAUNCH_ATTRIBUTE_COOPERATIVEwhen preparing the launch config forcuLaunchKernelEx. What would be the right way to test this capability? Write a dummy kernel that only callsgrid.sync()and launch it with/without setting the boolean to True?
I think its more than just that using the appropriate launch attribute to avoid deadlocks/hangs. To make sure that the concerns are separated by design choices, I am separating the discussion into 2 options
- Option (a) if
LaunchConfigonly exposescooperative_launchattribute, then in the usecase of multi-GPU or grid-sync, there needs to be additional legalization checks to make sure thatgridDimsis consistent across all GPUs/all blocks and they must not exceed more than the max SMs per GPU queryable usingcudaOccupancyMaxActiveBlocksPerMultiprocessor. e.g. https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html#cooperative-launches - Option (b) if
LaunchConfigexposes generic class objectLaunchAttributethat maps 1:1 toCUlaunchAttribute, wouldn't it allow for more extendible interface for other launch attribute usecases ?
What would be the right way to test this capability? Write a dummy kernel that only calls
grid.sync()and launch it with/without setting the boolean to True?
I think there are few different things to test
- Throw errors if launch fails because of too many blocks or inconsistent blocks across GPUs
- Throw errors if launch hangs due to use of grid.sync without cuLaunchKernelEx(CU_LAUNCH_ATTRIBUTE_COOPERATIVE)
- Throw errors if launch requested is cooperative, but driver support is not present
- Option (a) if
LaunchConfigonly exposescooperative_launchattribute, then in the usecase of multi-GPU or grid-sync, there needs to be additional legalization checks to make sure thatgridDimsis consistent across all GPUs/all blocks and they must not exceed more than the max SMs per GPU queryable usingcudaOccupancyMaxActiveBlocksPerMultiprocessor. e.g. https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html#cooperative-launches- Option (b) if
LaunchConfigexposes generic class objectLaunchAttributethat maps 1:1 toCUlaunchAttribute, wouldn't it allow for more extendible interface for other launch attribute usecases ?
My current thought is leaning toward option a, but I do see the value of cuda.core doing some sanity checks for users. For example, we do this for checking thread block cluster support:
https://github.com/NVIDIA/cuda-python/blob/96d3ba008f05a2a864da097afca8d14c5c038bbf/cuda_core/cuda/core/experimental/_launcher.py#L73-L83
However, it should still be as lightweight as possible. We don't want cuda.core.launch() to show up in the critical path.
FWIW the occupancy query APIs are tracked in #504, due to the request from the CUTLASS team, but it seems needed as part of this discussion too.
- Option (a) if
LaunchConfigonly exposescooperative_launchattribute, then in the usecase of multi-GPU or grid-sync, there needs to be additional legalization checks to make sure thatgridDimsis consistent across all GPUs/all blocks and they must not exceed more than the max SMs per GPU queryable usingcudaOccupancyMaxActiveBlocksPerMultiprocessor. e.g. https://numba.readthedocs.io/en/stable/cuda/cooperative_groups.html#cooperative-launches- Option (b) if
LaunchConfigexposes generic class objectLaunchAttributethat maps 1:1 toCUlaunchAttribute, wouldn't it allow for more extendible interface for other launch attribute usecases ?My current thought is leaning toward option a, but I do see the value of cuda.core doing some sanity checks for users. For example, we do this for checking thread block cluster support:
cuda-python/cuda_core/cuda/core/experimental/_launcher.py
Lines 73 to 83 in 96d3ba0
if self.cluster is not None: if not _use_ex: err, drvers = driver.cuDriverGetVersion() drvers_fmt = f" (got driver version {drvers})" if err == driver.CUresult.CUDA_SUCCESS else "" raise CUDAError(f"thread block clusters require cuda.bindings & driver 11.8+{drvers_fmt}") cc = Device().compute_capability if cc < (9, 0): raise CUDAError( f"thread block clusters are not supported on devices with compute capability < 9.0 (got {cc})" ) self.cluster = cast_to_3_tuple("LaunchConfig.cluster", self.cluster)
However, it should still be as lightweight as possible. We don't want
cuda.core.launch()to show up in the critical path.
Great. We could also add attribute option to make checks conditionals for power-users, if needed to optimize for latency in critical path if needed. I am okay with option (a) too and add others on a usecase basis.
@leofang I did some more digging and there are additional usecases in multi-GPU kernels where the following launch attributes can be used. This makes option (b) more attractive.
(a) Usecase: Enables Cooperative Group Arrays on SM90+
-
CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION -
CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE
(b) Usecase: Enables memory sync interference avoidance between kernels in different domains (SM90+)
-
CU_LAUNCH_ATTRIBUTE_MEM_SYNC_DOMAIN
(c) Usecase: Enables tracing events with CUDA graphs
-
CU_LAUNCH_ATTRIBUTE_LAUNCH_COMPLETION_EVENT
Throw errors if launch hangs due to use of grid.sync without cuLaunchKernelEx(CU_LAUNCH_ATTRIBUTE_COOPERATIVE)
This does not seem to hang; it seems to raise a sticky CUDA_ERROR_LAUNCH_FAILED error. @pciolkosz Does this sound right?
Yes, in grid.sync() there is a check if the launch was cooperative and I think it traps if it wasn't, so the observed behavior makes sense
Cooperative launch is implemented in #676. Moving this issue to P1 / parking lot. We still want to keep this open for covering more launch attributes.