cutlass [QST] What's the motivation of example 63?

Hi! I am trying to understand where the technique mentioned in the example 63 can be used in the broad scheme of techniques that PDL is good for. Specifically, the example proposes to use a prefetch warp group which can be triggered through the previous kernel execution. However, more broadly, if the GPU block scheduler has already decided to schedule a thread-block from the next kernel on to the same SM as a thread-block from the same kernel then the assumption is that there are enough resources (SMEM...) to avoid conflicts.

Given that context, I am trying to understand when this technique would be used? Or was the intention of this example to be more of a POC rather than a direct example of what could be applicable?

May 31 '25 22:05 NihalPotdar

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization

Jun 02 '25 13:06 mnicely

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Jul 03 '25 18:07 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Oct 01 '25 19:10 github-actions[bot]