[BUG][Inductor-EVT] Python EVT tracer generates incorrect code when assigning accumulator to output D
Describe the bug When assigning accum arg to D, invalid code is generated. Namely, an alias EVTD is used but not defined (line 18 in the output code). This restriction should probably either be a hard error or generate the definition of the type alias.
Steps/Code to reproduce bug
Expected behavior Generate buildable code
Environment details (please complete the following information): Meta devgpu, although this should repro on any other machine.
To workaround this I ensure that accum is not assigned to D. (swapping D and E in the example yields valid code)
cc @thakkarV, @mnicely, @henrylhtsang, @eellison
@apuaaChen, @jackkosaian , could you please take a look?
I've done more digging, it looks like this always happens if D is used anywhere other than the output (EVTD is not generated).
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Hi @mlazos, I think that's expected for sm90. "C" and "D" are hardcoded in the epilogue, and "D" should always take the output of EVT. This design enables smem reuse between C and D under certain conditions, despite the inconvenience you found.
The following pattern works:
def evt_direct_store(accum):
F = accum
D = F + 1
return D, F
Note that sm80 doesn't have this restriction as all load/stores are generated by EVT.
@jackkosaian @hwu36 for viz
Hi @mlazos, I think that's expected for sm90. "C" and "D" are hardcoded in the epilogue, and "D" should always take the output of EVT. This design enables smem reuse between C and D under certain conditions, despite the inconvenience you found.
The following pattern works:
def evt_direct_store(accum): F = accum D = F + 1 return D, FNote that sm80 doesn't have this restriction as all load/stores are generated by EVT.
@jackkosaian @hwu36 for viz
To be clear, you're saying D needs to be the final result of the tree right?
If I use C in this how does that work? are there restrictions around the ops I can perform on C?
Thanks for the help so far! these are open questions I've been wondering about for a while
Hi @mlazos, there is no restrictions on C as far as I remember.
Btw, 4.1 release add the verification for D being the final result of the tree. Here is a unit test tracking that: https://github.com/NVIDIA/cutlass/blob/main/test/python/cutlass/evt/evt_store_sm80_90.py#L53
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.