cutlass Avoid LDGSTS routing by changing default copy to be universalcopy

https://github.com/NVIDIA/cutlass/issues/1672

This PR changes the default copy to be UniversalCopy so the LDGSTS instruction is avoided, and downstream users will need to specify the copy type if they want to use it which is more intuitive.

Note: I imagine since all of the tests and examples that are configured to use DefaultCopy will need to now transition over to AutoVectorizingCopyWithAssumedAlignment<128> as that was the previous copy type. This way we can preserve behavior across benchmarks and tests. Before I make a bunch of changes across files though I'd like to get feedback now incase I'm missing anything.

Aug 01 '24 18:08 ZelboK

cc @thakkarV

Any ideas on other places of CUTLASS that might need help? Perhaps any Ampere related issues?

Aug 01 '24 18:08 ZelboK

@ccecka I like this change. wdyt?

Aug 01 '24 20:08 thakkarV

how about removing select_elementwise_copy and just using UniversalCopy<SrcType,DstType>{} by default to avoid LDGSTS?

Aug 02 '24 01:08 mammoth831

Or (un)setting CUTE_ARCH_CP_ASYNC_SM80_ENABLED so that select_elementwise_copy always chooses UniversalCopy.

All of these changes look very dangerous to me.

Aug 02 '24 02:08 ccecka

Or (un)setting CUTE_ARCH_CP_ASYNC_SM80_ENABLED so that select_elementwise_copy always chooses UniversalCopy.

The problem as I understood it was that logically it's a bit unintuitive that DefaultCopy dispatches to this instruction because the user doesn't actually know that it's doing this unless they inspect under the hood. Wouldn't unsetting this be equivalent to the user just specifying UniversalCopy from the get go? They have to know what's going on in the first place, and it doesn't prevent them from having to discover they need more synchronizations.

All of these changes look very dangerous to me.

I'm probably missing some context here. Could you give an example situation of where this change could go wrong? (Might be a stupid question 😅 )

Aug 02 '24 02:08 ZelboK

how about removing select_elementwise_copy and just using UniversalCopy<SrcType,DstType>{} by default to avoid LDGSTS?

I think this could work too, actually, IIUC if downstream code specifies that it wants to do a specific async cp it'll go through a different codepath from this anyway. Also won't need to change the default type so less code changes overall.

Aug 02 '24 02:08 ZelboK

@ccecka and I had a conversation about this internally, and have some thoughts.

we likely do not want to disable this auto LDGSTS. This is because it is only enabled when SM80 is set
when the source tensor is tagged to have a gmem pointer
when the destination tensor is tagged to have an smem pointer

As for the MR itself, you are replacing a copy with a universal copy with assumed alignment of 128b. This is very dangerous because not all tensors may have that alignment.

We are working on a nicer fix of this internally that entails adding a simple assignment copy type that is only used when no atom is specified.

Aug 07 '24 01:08 thakkarV

As for the MR itself, you are replacing a copy with a universal copy with assumed alignment of 128b. This is very dangerous because not all tensors may have that alignment.

I actually made a mistake I believe and I meant to do AutoVectorizingCopyWithAssumedAlignment<8>; and not 128... That is my bad. With that being said I'm curious to know if I'm missing any context - the DefaultCopy is set to AutoVectorizingCopyWithAssumedAlignment<8>;, right? If all DefaultCopys in the codebase were replaced with this explicit type why would it be dangerous? Or was that only the case because in the code I mistakenly put 128 bits?

Aug 07 '24 17:08 ZelboK

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Sep 06 '24 18:09 github-actions[bot]

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

Dec 05 '24 18:12 github-actions[bot]