[BUG] CUTLASS 3.6 profiler doesn't read Instantiation Level that we pass in (Hopper SM90)
Describe the bug
1st bug: 4-digit CUTLASS_LIBRARY_INSTANTIATION_LEVEL is not used.
Here it said that CUTLASS 3.6 profiler can use an additional flag, CUTLASS_LIBRARY_INSTANTIATION_LEVEL , to instantiate all possible combinations. It also said that "The CUTLASS profiler employs a four-digit integer level (global instantiation level) mechanism to manage the generation of kernel configurations.".
However, I found that this 4-digit CUTLASS_LIBRARY_INSTANTIATION_LEVEL is not used when compiling, unless it is set to max.
The process of reading CUTLASS_LIBRARY_INSTANTIATION_LEVEL is as follows:
-
Here
generator.pyreads this argument. -
Here this argument is used in
__init__()of classManifest. It's also effective when it is set it tomax. -
Here when we are actually
GenerateSM90kernels,get_sm90_instantiation_level()function only returns some fixed default level.
2nd bug: when I manually modify the default level in the generator.py, I get compile error with warp specialization kernels (it says stage < 2).
The error message when make cutlass_profiler is as follows:
~/cutlass/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp(135): error: static assertion failed with "Specialization requires Stages set to value 2 or more."
static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 2 or more.");
^
detected during instantiation of class "cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<Stages, ClusterShape, KernelSchedule>, TileShape_, ElementA_, StrideA_, ElementB_, StrideB_, TiledMma_, GmemTiledCopyA_, SmemLayoutAtomA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyB_, SmemLayoutAtomB_, SmemCopyAtomB_, TransformB_> [with Stages=1, ClusterShape=cute::tuple<cute::_4, cute::_4, cute::_1>, KernelSchedule=cutlass::gemm::KernelTmaWarpSpecializedCooperative, TileShape_=cute::tuple<cute::_128, cute::_128, cute::_256>, ElementA_=cutlass::half_t, StrideA_=cute::tuple<cute::C<1>, int64_t, int64_t>, ElementB_=cutlass::half_t, StrideB_=cute::tuple<cute::C<1>, int64_t, int64_t>, TiledMma_=cute::TiledMMA<cute::MMA_Atom<cute::SM90::GMMA::MMA_64x128x16_F32F16F16_SS<cute::SM90::GMMA::Major::MN, cute::SM90::GMMA::Major::MN, cute::SM90::GMMA::ScaleIn::One, cute::SM90::GMMA::ScaleIn::One>>, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::C<0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, GmemTiledCopyA_=cute::SM90_TMA_LOAD_MULTICAST, SmemLayoutAtomA_=cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::_64, cute::_8>, cute::tuple<cute::_1, cute::C<64>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyB_=cute::SM90_TMA_LOAD_MULTICAST, SmemLayoutAtomB_=cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::_64, cute::_8>, cute::tuple<cute::_1, cute::C<64>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity]" at line 65 of ~i/cutlass/build/tools/library/generated/gemm/90/f16_s64x64x16gemm_f16/cutlass3x_sm90_tensorop_s64x64x16gemm_f16_f16_f32_f16_f16_128x128x256_4x4x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu
Steps/Code to reproduce bug
- Change this line's
default_level=131todefault_level=1592; change this line'sdefault_level=101todefault_level=1592. - Build FP16 kernel by:
cmake .. \ -DCUTLASS_NVCC_ARCHS="90a" \ -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_*gemm_f16_f16_f32_f16_f16_*" \ -DCUTLASS_LIBRARY_EXCLUDE_KERNELS="*_fastaccum_*" \ -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="1000" \ -DCUTLASS_UNITY_BUILD_ENABLED=ON make cutlass_profiler -j - Get the compile error shown above.
This compile error also occurs when building FP8 kernels.
Additional context
Also, when simply set CUTLASS_LIBRARY_INSTANTIATION_LEVEL to max, we will also get compile error when building lots of rare shapes like this
@alihassanijr , could you please help with this?
Thanks for the detailed description @tonyjie .
Regarding 1, you are correct, CUTLASS_LIBRARY_INSTANTIATION_LEVEL is limited to pre-defined string levels like "default" and "max". I don't exactly remember why that was, some of it was to preserve previous behavior, but if we want to directly expose the numeric levels to users, we can do that. I'll have to defer to @hwu36 to make that call.
Regarding 2, unfortunately that looks like a much deeper issue, which is that the heuristics set up in the refactored generator don't catch all "invalid" kernels. I'll try and see if I can build with the "max" level and add the necessary heuristics. I'll update this issue soon.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue should be resolved as of the 3.7 release.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.