Tri Dao
Tri Dao
Hi @ptillet and the Triton team! Super happy that FlashAttention is now implemented in [Triton](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) 🚀. However, it seems to give wrong output when head dim != 64. For example,...
Hi, VkFFT says they support zero padding: "Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of...
This GMMA shape is being used in FA3 backward pass for headdim 256 (tile size 64 x 80, split into 2 WGs). cc @thakkarV
``` using namespace cute; auto tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int{}, _1{})); Tensor A = make_tensor(make_tmem_ptr(0), tmem_layout); auto load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A); ``` This fails to compile, with shape_div error. This...
The compiler hangs with this code, where we call copy with the wrong copy atom / predicates. The compiler still should error out instead of hanging. **Steps/Code to reproduce bug**...
Wrap nvvm.barrier_arrive. This function is useful in FA3 for inter-warpgroup overlap.
I'm trying to use Cute-DSL on GB200 but I'm unable to install it as the wheel doesn't work on arm64: ``` nvidia_cutlass_dsl-4.0.0.dev1-cp312-cp312-manylinux_2_28_x86_64.whl is not a supported wheel on this platform....
**Describe the bug** If the loop is dynamic, with negative step size, the for loop is wrong. In the example below, it does not enter the for loop at all....
**Is your feature request related to a problem? Please describe.** I'm planning using these warp vote instructions to coordinate threads in the same warp. **Describe the solution you'd like** Can...
[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary
**Describe the bug** The SASS code for UMMA uses more registers than necessary: the registers holding tmem address and idesc keep changing between instructions. This is important as it affects...