[QST]Do we need to tune cutlass gemm to use it for all shape?
I see the example here, https://github.com/NVIDIA/cutlass/blob/main/examples/00_basic_gemm/basic_gemm.cu
It doesn't requires the config for tile size, grid size, warp size etc.
My question is that:
- Does it look for the best config automatically ? Or do I just get a kernel with default config?
- Do I need to tune it when using it for dynamic shapes? How to do that?
Thank you!
No the example just picks one kernel at compile time and has no way of picking the best kernel for your specific problem shape and domain constraints. Use the profiler to search for the best kernel.
No the example just picks one kernel at compile time and has no way of picking the best kernel for your specific problem shape and domain constraints. Use the profiler to search for the best kernel.
I see the group gemm example, so the kernel is not the best performance, right?
If I want to generate kernel configs suitable for a lot of shapes, do I need to run the profiler for each shape and record the configurations and select them when running inference?
Is there an example for that?
Thank you !
https://github.com/NVIDIA/cutlass/blob/main/examples/24_gemm_grouped/gemm_grouped.cu#L1190
// Redefine GEMM with different GroupScheduleMode_
using GemmKernel = typename cutlass::gemm::kernel::DefaultGemmGrouped<
typename Gemm_::ElementA,
typename Gemm_::LayoutA,
Gemm_::kTransformA,
Gemm_::kAlignmentA,
typename Gemm_::ElementB,
typename Gemm_::LayoutB,
Gemm_::kTransformB,
Gemm_::kAlignmentB,
typename Gemm_::ElementC,
typename Gemm_::LayoutC,
typename Gemm_::ElementAccumulator,
typename Gemm_::OperatorClass,
typename Gemm_::ArchTag,
typename Gemm_::ThreadblockShape,
typename Gemm_::WarpShape,
typename Gemm_::InstructionShape,
typename Gemm_::EpilogueOutputOp,
typename Gemm_::ThreadblockSwizzle,
Gemm_::kStages,
GroupScheduleMode_>::GemmKernel;
using Gemm = cutlass::gemm::device::GemmGrouped<GemmKernel>;
No the example just picks one kernel at compile time and has no way of picking the best kernel for your specific problem shape and domain constraints. Use the profiler to search for the best kernel.
I see the Group Gemm in TensorRT-LLM, is this a real application that always gets a good performance for dynamic shapes? Thank you !
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/groupGemm.cu#L202
if (minKN >= 8)
{
groupedGemmType_<16, 32, 64, 16, 32, 64, 8, 8, 4>(problem_sizes, ptrA, ptrB, ptrC, ptrD,
gemmParamsWorkSpace, gemmParamsWorkSpaceSize, gemmWorkSpace, gemmWorkspaceSize, dataType, stream);
}
else if (minKN >= 4)
{
groupedGemmType_<16, 32, 64, 16, 32, 64, 8, 4, 4>(problem_sizes, ptrA, ptrB, ptrC, ptrD,
gemmParamsWorkSpace, gemmParamsWorkSpaceSize, gemmWorkSpace, gemmWorkspaceSize, dataType, stream);
}
else if (minKN >= 2)
{
groupedGemmType_<16, 32, 64, 16, 32, 64, 8, 2, 2>(problem_sizes, ptrA, ptrB, ptrC, ptrD,
gemmParamsWorkSpace, gemmParamsWorkSpaceSize, gemmWorkSpace, gemmWorkspaceSize, dataType, stream);
}
else if (minKN >= 1)
{
groupedGemmType_<16, 32, 64, 16, 32, 64, 8, 1, 2>(problem_sizes, ptrA, ptrB, ptrC, ptrD,
gemmParamsWorkSpace, gemmParamsWorkSpaceSize, gemmWorkSpace, gemmWorkspaceSize, dataType, stream);
}
Generally, a single kernel instance will not perform well across all possible input shapes it supports. You will have to create a runtime that dispatches to different kernels based on input shapes like in the TRT snippet you pasted
Generally, a single kernel instance will not perform well across all possible input shapes it supports. You will have to create a runtime that dispatches to different kernels based on input shapes like in the TRT snippet you pasted
Sorry for the naiive questions, but I'm confused about that.
I read codes in trtllm about tunning gemms. I find the cutlass configs there:
// Note: The shapes are in the format MxNxK. The K shape of the runtime config MUST match the K shape
// in the kernel layout details when doing weight only quantization.
enum class CutlassTileConfig
{
// Signals that we should run heuristics do choose a config
Undefined,
// Signals that we should run heuristics do choose a config
ChooseWithHeuristic,
// SiMT config
CtaShape128x128x8_WarpShape64x64x8,
// TensorCore configs CTA_N = 128, CTA_K = 64
// Warp configs for M=16
CtaShape16x128x64_WarpShape16x32x64,
// Warp configs for M=32
CtaShape32x128x64_WarpShape32x32x64,
// Warp configs for M=64
CtaShape64x128x64_WarpShape32x64x64,
CtaShape64x64x128_WarpShape32x64x64,
CtaShape64x128x64_WarpShape64x32x64,
// Warp configs for M=128
CtaShape128x64x64_WarpShape64x32x64,
CtaShape128x128x64_WarpShape64x32x64,
CtaShape128x128x64_WarpShape64x64x64,
CtaShape128x128x64_WarpShape128x32x64,
CtaShape128x256x64_WarpShape64x64x64,
// Warp configs for M=256
CtaShape256x128x64_WarpShape64x64x64,
// TensorCore config CTA_N = 64, CTA_K = 128
CtaShape128x64x128_WarpShape64x32x128,
// TensorCore config CTA_N = 256, CTA_K = 64
CtaShape16x256x64_WarpShape16x64x64,
// TensorCore config CTA_N = 256, CTA_K = 128
CtaShape16x256x128_WarpShape16x64x128
};
Best configs for all shapes are choosen from these configs.
I'm confused how they choose these ones? For example, why K is always no less than 64 ?
I know the mma instruction shape is like 16x8x16 or 16x8x8, why can't the tile k be 16 or 32 ?
I want to know some experience about choosing candidate configs for cutlass kernels.
Do you have any suggestion about examples or guidelines about this? I'm really confused about this for days.
Thank you a lot!
warp tile size k should be bigger than mma instruction k so that we can run multiple mma in the inner loop to use mma to hide other latencies.
cutlass profiler supports group gemm for sm90 and sm100. if you want the best config for your problem sizes, you can run cutlass profiler with all your possible problem sizes to find the best one.
warp tile size k should be bigger than mma instruction k so that we can run multiple mma in the inner loop to use mma to hide other latencies.
cutlass profiler supports group gemm for sm90 and sm100. if you want the best config for your problem sizes, you can run cutlass profiler with all your possible problem sizes to find the best one.
So the candidates in the trtllm kernel configs are only a part of all the possible configs to get a good performance, not the best performance, right?
Correct. In general, there can be O(1e5) to (1e6) candidate kernels for a given problem
Correct. In general, there can be O(1e5) to (1e6) candidate kernels for a given problem
Thank you. There seems to be a lot in common with open-ai triton auto tuning.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
I want to know how to update default kernels parameters correctly and make them effective? It seems that there are too many related documents. Of course I am a beginner.
I want to know how to update default kernels parameters correctly and make them effective? It seems that there are too many related documents. Of course I am a beginner.
I think you need to give an example of what specific question you are confused with.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.