[QST] How to profile the CUTLASS with all of the optimization techniques?
What is your question? Hi,
Thanks for the great work! Recently, I am exploring the performance improvement from all of the optimization in CUTLASS. I want to profile all of the optimizaiton introduced in CUTLASS and know which optimization is too important or not. Specificially, as for a GEMM operator, I want to know the performance improvment from which optimization such as datalayout design or thread block swizzle or tensor core or so on. It is alike the ablation study. Can you give me some suggestions to do this exploration?
Thanks, Yang
your best bet is to take the SM80 mainloop from CUTLASS 3.x and replace the tiled mma and tiled copies, change pipeline depth etc. you can find a testcase for it here: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm80_gemm_f16_f16_f32_tensor_op_f32.cu
Note that 3.x Ampere kernel are not optimized to the same extent 2.x API kernel are, but you can only do what you are trying to with the CuTe based implementations.
Hi, @thakkarV Thanks for your quick response! I have already known how to change the tiling size and pipelining stage. Is there anything to verify the threadblock swizzle or other performance specific in CUTLASS. Thanks a lot!
you can change the thread or the val layouts for the copies to change the coalescing or vectorization extents. you can change the smem layouts to different swizzles or any affine layout you want to change the STG/LDS vectorization or bank conflict count etc. tons of things you can do
Hi @thakkarV , Thanks for your reply! I have successfully gotten my ablation study! I am curious about what you mean "affine layout"? Can you provide me with some materials? Thanks a lot!
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.