Fix MMA promotion interval assertions
For BLOCK_SIZE_K=256, GmmaFP8Accumulation has accum_promotion_interval=4 but mma_count_per_mainloop_iteration=8, which makes a non-FP8-fast-accum kernel never promote to FP32 accumulators. This PR fixes the wrong assertion by changing 4 into the real number of MMA instructions issued.
Anyone replies to this? I do think it's a serious bug, making BLOCK_SIZE_K=256 made FP8 training loss curve much worse than non-FP8-fast-accum.
@IonThruster
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.
@yzhaiustc can we please put it on the list for 3.6?
@yzhaiustc can we please put it on the list for 3.6?
sure.
@manishucsd