cutlass [QST] how to use tma to store strided subtiles?

What is your question?

Suppose we are calculating a 4x4 tensor and we only have a 2x4 smem resource.

When the results are computed by different warp groups, e.g. the first two rows of the results are from warpgroup0 while the last two rows of the results are from warpgroup1, do we have to store them twice?

The first time each warp group writes half results to the smem, and the first thread of each warp group issues a 1x4 tma store. The second time each warp group writes another half results to the smem, and again issues a 1x4 tma store.

// 4x4 gtensor

       0    1    2    3
    +----+----+----+----+
 0  |  0 |  1 |  2 |  3 |
    +----+----+----+----+
 1  |  4 |  5 |  6 |  7 |
    +----+----+----+----+
 2  |  8 |  9 | 10 | 11 |
    +----+----+----+----+
 3  | 12 | 13 | 14 | 15 |
    +----+----+----+----+

// 2x4 stensor - first time store

       0    1    2    3
    +----+----+----+----+
 0  |  0 |  1 |  2 |  3 |   --> write to the 1st row of the gtensor
    +----+----+----+----+
 1  |  8 |  9 | 10 | 11 |   --> write to the 3rd row of the gtensor
    +----+----+----+----+

I tried to construct a proper tma copy but it seems there is no suitable API to do it. How to deal with this situation with tma?

#include <cute/tensor.hpp>
#include <cute/atom/copy_atom.hpp>

#include <vector>

using namespace cute;


int main() {
  using CopyOp = SM90_TMA_STORE;
  std::vector<float> d(16);
  auto gtensor = make_tensor(d.data(), make_layout(make_shape(4, 4), make_stride(4, _1{})));  // 4x4 results
  auto slayout = make_layout(make_shape(_1{}, _4{}), GenRowMajor{});    // 1x4 tma tile
  auto stensor = make_tensor(d.data(), make_layout(make_shape(_2{}, _4{}), GenRowMajor{})); // 2x4 smem

  // let cluster size =2  to extend tma thr layout, but it is used for multicast here and will lead to wrong slices and offsets when partitioning...
  auto tma = make_tma_copy(CopyOp{}, gtensor, slayout, Int<2>{}); 
  

  // How to get a tma tiled copy which can partition stensor and gtensor and do the copy like this:
  auto slice_tma = tma.get_slice(threadIdx.x / 128);
  auto tma_tensor = tma.get_tma_tensor();
  copy(tma.with(bar),  slice_tma.partition_S(stensor)(_, 0), slice_tma.partition_D(tma_tensor)(_,0))
  copy(tma.with(bar),  slice_tma.partition_S(stensor)(_, 1), slice_tma.partition_D(tma_tensor)(_,1))
}

Sep 22 '24 15:09 mammoth831

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Oct 23 '24 02:10 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Jan 21 '25 03:01 github-actions[bot]