[BUG] sm_count may be ignored in persistent GEMMs
Description
When specifying arguments to a GEMM kernel via GemmUniversalArguments, the user may set sm_count to a particular value in order to carve out multiprocessors for other concurrent work. However, for persistent GEMM kernels I've found that this field is ignored, and all SMs are used regardless of the value. I believe this is because of this conditional branch:
https://github.com/NVIDIA/cutlass/blob/e9627ce55b42fd2599f58cd4396da9380954def0/include/cutlass/gemm/kernel/tile_scheduler_params.h#L1790
given that the max_active_clusters is populated by cudaOccupancyMaxActiveClusters and does not take sm_count into account. This patch was able to resolve my issue:
diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 9ac78311..1c646009 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -263,11 +263,13 @@ struct PersistentTileSchedulerSm90Params {
// In case the maximum number of clusters that could co-exist on the target device is
// already calculated using cudaOccupancyMaxActiveClusters
else if (max_active_clusters != 0) {
+ auto max_launchable_clusters = possibly_truncate(max_active_clusters, sm_count / cluster_size);
+
if (raster_order == RasterOrder::AlongN) {
- launch_grid.y = max_active_clusters * cluster_shape.n();
+ launch_grid.y = max_launchable_clusters * cluster_shape.n();
}
else {
- launch_grid.x = max_active_clusters * cluster_shape.m();
+ launch_grid.x = max_launchable_clusters * cluster_shape.m();
}
CUTLASS_TRACE_HOST("get_grid_shape(): Proposed GridDims by the scheduler using cudaOccupancyMaxActiveClusters = "
"(" << launch_grid.x << ", " << launch_grid.y << ", " << launch_grid.z << ")\n");
but I am not familiar enough with this code to know if it has unintended effects.
Steps/Code to reproduce bug
I don't have a great minimal reproduction example. But this should happen on any persistent GEMM launch that specifies sm_count and autopopulates max_active_clusters.
Expected behavior
Persistent GEMMs launched with non-default sm_count use less than or equal to sm_count SMs.
Environment details (please complete the following information):
- CUTLASS commit b78588d1630aa6643bf021613717bafb705df4ef
- Ubuntu 22.04
- CUDA Toolkit 12.4
- H100
CC @jackkosaian
@thakkarV , @hwu36 , @jackkosaian it will be good if we can take a look at it and get resolved before the 3.8 tagging.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
it will be good if we can take a look at it and get resolved before the 3.8 tagging.
Hello, I would like to know if this issue has been resolved? Because tag3.8 has been released for a while. @manishucsd
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.