mammoth831
mammoth831
> Yes, you can replace these lines with `fast_divmod`. hi, the `p_` and `q_` are computed from `start_r` and `start_s` (`start_r`->`start_h_`->`p_`), while `start_r` and `start_s` are decided by the thread...
cc @hwu36 @ccecka Could you please help to check this issue?
just try: ```c++ #include using namespace cute; int main() { auto a = Layout{}; print_layout(a); auto b = composition(Swizzle{}, a); print_layout(b); } ``` ``` (_4,_4):(_4,_1) 0 1 2 3 +----+----+----+----+...
If you treat it as an 8-bank SRAM and each swizzling chunk size is 2, then the swizzle pattern you want is a common case for the generic swizzle functor....
how about removing `select_elementwise_copy` and just using `UniversalCopy{}` by default to avoid LDGSTS?