cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[BUG][Inductor-EVT] Python EVT tracer generates incorrect code when assigning accumulator to output D

Open mlazos opened this issue 8 months ago • 3 comments

Describe the bug When assigning accum arg to D, invalid code is generated. Namely, an alias EVTD is used but not defined (line 18 in the output code). This restriction should probably either be a hard error or generate the definition of the type alias.

Steps/Code to reproduce bug

script output code

Expected behavior Generate buildable code

Environment details (please complete the following information): Meta devgpu, although this should repro on any other machine.

To workaround this I ensure that accum is not assigned to D. (swapping D and E in the example yields valid code)

cc @thakkarV, @mnicely, @henrylhtsang, @eellison

mlazos avatar May 12 '25 21:05 mlazos

@apuaaChen, @jackkosaian , could you please take a look?

hwu36 avatar May 13 '25 02:05 hwu36

I've done more digging, it looks like this always happens if D is used anywhere other than the output (EVTD is not generated).

mlazos avatar May 16 '25 22:05 mlazos

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jun 15 '25 23:06 github-actions[bot]

Hi @mlazos, I think that's expected for sm90. "C" and "D" are hardcoded in the epilogue, and "D" should always take the output of EVT. This design enables smem reuse between C and D under certain conditions, despite the inconvenience you found.

The following pattern works:

def evt_direct_store(accum):
    F = accum
    D = F + 1
    return D, F

Note that sm80 doesn't have this restriction as all load/stores are generated by EVT.

@jackkosaian @hwu36 for viz

apuaaChen avatar Jun 16 '25 21:06 apuaaChen

Hi @mlazos, I think that's expected for sm90. "C" and "D" are hardcoded in the epilogue, and "D" should always take the output of EVT. This design enables smem reuse between C and D under certain conditions, despite the inconvenience you found.

The following pattern works:

def evt_direct_store(accum):
    F = accum
    D = F + 1
    return D, F

Note that sm80 doesn't have this restriction as all load/stores are generated by EVT.

@jackkosaian @hwu36 for viz

To be clear, you're saying D needs to be the final result of the tree right?

If I use C in this how does that work? are there restrictions around the ops I can perform on C?

Thanks for the help so far! these are open questions I've been wondering about for a while

mlazos avatar Jun 18 '25 00:06 mlazos

Hi @mlazos, there is no restrictions on C as far as I remember.

Btw, 4.1 release add the verification for D being the final result of the tree. Here is a unit test tracking that: https://github.com/NVIDIA/cutlass/blob/main/test/python/cutlass/evt/evt_store_sm80_90.py#L53

apuaaChen avatar Jul 18 '25 20:07 apuaaChen

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Oct 17 '25 16:10 github-actions[bot]