Fix TMEM address read/write race in example 77
The MMA warp category's TMEM address write to shared memory is made visible to the epilogue/correction warp categories implicitly due to intermediate barriers between the warps which synchronize the MMA output. But in the 0 KV tile case, there is no such barrier with the epilogue so the write might not be visible. In that case the epilogue warp can read undefined data from shared memory for the TMEM address and the tcgen05::dealloc may fail.
This patch attempts to fix this issue by only allocating the TMEM when there is at least 1 KV tile by a persistent CTA. If there is at least one KV tile, the TMEM address should be correctly synchronized, even if the final work tile processed has 0 KV.
Another potential solution is to perform the dealloc in the MMA warp category and add extra synchronization from the TMEM consumers signaling that their usage is complete.
P.S. PTX ISA Manual states that
When .cta_group::1 is specified, one warp from the CTA must perform the allocation and de-allocation.
which is ambiguously phrased in my opinion, as this could either mean that the same warp must perform the alloc and dealloc (which is not the case here), or different single warps can perform alloc and dealloc for a given TMEM allocation.
@hwu36 , @IonThruster , @v0i0 , @richardmcai , Should compute-sanitizer be able to catch such issues? Does CUTLASS CI runs compute-sanitizer to rule out such issues?
Iirc synccheck can find this (due to the smem read / write) thats how I recently found something like this