Tri Dao

Results 12 issues of Tri Dao

Hi @ptillet and the Triton team! Super happy that FlashAttention is now implemented in [Triton](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) 🚀. However, it seems to give wrong output when head dim != 64. For example,...

enhancement

Hi, VkFFT says they support zero padding: "Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of...

This GMMA shape is being used in FA3 backward pass for headdim 256 (tile size 64 x 80, split into 2 WGs). cc @thakkarV

``` using namespace cute; auto tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int{}, _1{})); Tensor A = make_tensor(make_tmem_ptr(0), tmem_layout); auto load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A); ``` This fails to compile, with shape_div error. This...

bug
? - Needs Triage
inactive-30d

The compiler hangs with this code, where we call copy with the wrong copy atom / predicates. The compiler still should error out instead of hanging. **Steps/Code to reproduce bug**...

bug
? - Needs Triage
inactive-30d

Wrap nvvm.barrier_arrive. This function is useful in FA3 for inter-warpgroup overlap.

I'm trying to use Cute-DSL on GB200 but I'm unable to install it as the wheel doesn't work on arm64: ``` nvidia_cutlass_dsl-4.0.0.dev1-cp312-cp312-manylinux_2_28_x86_64.whl is not a supported wheel on this platform....

feature request

**Describe the bug** If the loop is dynamic, with negative step size, the for loop is wrong. In the example below, it does not enter the for loop at all....

bug
? - Needs Triage

**Is your feature request related to a problem? Please describe.** I'm planning using these warp vote instructions to coordinate threads in the same warp. **Describe the solution you'd like** Can...

feature request
? - Needs Triage

**Describe the bug** The SASS code for UMMA uses more registers than necessary: the registers holding tmem address and idesc keep changing between instructions. This is important as it affects...

bug
CuTe DSL