Avoid LDGSTS routing by changing default copy to be universalcopy
https://github.com/NVIDIA/cutlass/issues/1672
This PR changes the default copy to be UniversalCopy so the LDGSTS instruction is avoided, and downstream users will need to specify the copy type if they want to use it which is more intuitive.
Note: I imagine since all of the tests and examples that are configured to use DefaultCopy will need to now transition over to AutoVectorizingCopyWithAssumedAlignment<128> as that was the previous copy type. This way we can preserve behavior across benchmarks and tests. Before I make a bunch of changes across files though I'd like to get feedback now incase I'm missing anything.
cc @thakkarV
Any ideas on other places of CUTLASS that might need help? Perhaps any Ampere related issues?
@ccecka I like this change. wdyt?
how about removing select_elementwise_copy and just using UniversalCopy<SrcType,DstType>{} by default to avoid LDGSTS?
Or (un)setting CUTE_ARCH_CP_ASYNC_SM80_ENABLED so that select_elementwise_copy always chooses UniversalCopy.
All of these changes look very dangerous to me.
Or (un)setting
CUTE_ARCH_CP_ASYNC_SM80_ENABLEDso thatselect_elementwise_copyalways choosesUniversalCopy.
The problem as I understood it was that logically it's a bit unintuitive that DefaultCopy dispatches to this instruction because the user doesn't actually know that it's doing this unless they inspect under the hood. Wouldn't unsetting this be equivalent to the user just specifying UniversalCopy from the get go? They have to know what's going on in the first place, and it doesn't prevent them from having to discover they need more synchronizations.
All of these changes look very dangerous to me.
I'm probably missing some context here. Could you give an example situation of where this change could go wrong? (Might be a stupid question 😅 )
how about removing
select_elementwise_copyand just usingUniversalCopy<SrcType,DstType>{}by default to avoid LDGSTS?
I think this could work too, actually, IIUC if downstream code specifies that it wants to do a specific async cp it'll go through a different codepath from this anyway. Also won't need to change the default type so less code changes overall.
@ccecka and I had a conversation about this internally, and have some thoughts.
- we likely do not want to disable this auto LDGSTS. This is because it is only enabled when SM80 is set
- when the source tensor is tagged to have a gmem pointer
- when the destination tensor is tagged to have an smem pointer
As for the MR itself, you are replacing a copy with a universal copy with assumed alignment of 128b. This is very dangerous because not all tensors may have that alignment.
We are working on a nicer fix of this internally that entails adding a simple assignment copy type that is only used when no atom is specified.
As for the MR itself, you are replacing a copy with a universal copy with assumed alignment of 128b. This is very dangerous because not all tensors may have that alignment.
I actually made a mistake I believe and I meant to do AutoVectorizingCopyWithAssumedAlignment<8>; and not 128... That is my bad. With that being said I'm curious to know if I'm missing any context - the DefaultCopy is set to AutoVectorizingCopyWithAssumedAlignment<8>;, right? If all DefaultCopys in the codebase were replaced with this explicit type why would it be dangerous? Or was that only the case because in the code I mistakenly put 128 bits?
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.
This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.