Jin Wang

Results 11 comments of Jin Wang

@lw921014 could you please post the thread block and warp tile sizes? In case you haven't tried it, please sanity check if the following requirements are met: problem_N = threadblock_N...

problem0_N = threadblock0_N = warp0_N problem1_N = threadblock1_N = warp1_N problem0_N doesn't have to be the same as problem1_N. It is only required that problem0_N = problem1_K.

You'll need the same number of warps for each GEMM. In your example above, you use 4 warps for the 1st GEMM, but use 2 warps for the 2nd GEMM.

Your code snap shows the function run_nonfused_gemm_f16(). Did you also change the tile sizes for run_fused_gemm_f16()?

Here is a combination that works for me: using ThreadblockShape0 = cutlass::gemm::GemmShape; using WarpShape0 = cutlass::gemm::GemmShape; using ThreadblockShape1 = cutlass::gemm::GemmShape; using WarpShape1 = cutlass::gemm::GemmShape; using InstructionShape = cutlass::gemm::GemmShape; Note that...

>warp_N must be the same as problem_N It is true for this example since we need the input A matrix of 2nd GEMM to be RF-resident, and each warp computes...

Also along the lines, you may hit shared memory size limit for large threadblock_N. On Turing, shared memory size is 64KB. Using threadblock size (64,256,32) as an example, you'll need...

> I find in this [line](https://github.com/NVIDIA/cutlass/blob/master/examples/13_two_tensor_op_fusion/b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h#L145), EpilogueOutputOp0 is seted as `cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling`, does this ingore beta? Yes `cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling` only applies to epilogue with beta=0. If you have both alpha and beta,...

I'm working on a solution to add residual support for warp-tile iterator (either `WarpIteratorA1` for shmem-resident fusion or `FragmentIteratorA1` for RF-resident fusion). The vector iterator for bias/scaling is also required...