Tri Dao issues

Results 12 issues of


                                            Tri Dao

Triton's implementation of FlashAttention gives wrong output when head dim != 64

Hi @ptillet and the Triton team! Super happy that FlashAttention is now implemented in [Triton](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) 🚀. However, it seems to give wrong output when head dim != 64. For example,...

enhancement

Zero-padding support?

Hi, VkFFT says they support zero padding: "Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of...

Add GMMA shape m64n40k16

This GMMA shape is being used in FA3 backward pass for headdim 256 (tile size 64 x 80, split into 2 WGs). cc @thakkarV

[BUG] Tmem tiled copy with non power-of-2 size fails to compile

``` using namespace cute; auto tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int{}, _1{})); Tensor A = make_tensor(make_tmem_ptr(0), tmem_layout); auto load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A); ``` This fails to compile, with shape_div error. This...

bug

? - Needs Triage

inactive-30d

[BUG] CuteDSL compiler hangs when calling copy with mismatched predicates

The compiler hangs with this code, where we call copy with the wrong copy atom / predicates. The compiler still should error out instead of hanging. **Steps/Code to reproduce bug**...

bug

? - Needs Triage

inactive-30d

[CuteDSL] Wrap nvvm.barrier_arrive

Wrap nvvm.barrier_arrive. This function is useful in FA3 for inter-warpgroup overlap.

[FEA] [CuteDSL] Request for arm64 wheel

I'm trying to use Cute-DSL on GB200 but I'm unable to install it as the wheel doesn't work on arm64: ``` nvidia_cutlass_dsl-4.0.0.dev1-cp312-cp312-manylinux_2_28_x86_64.whl is not a supported wheel on this platform....

feature request

[BUG][CuTe DSL] for loop is wrong if step is negative

**Describe the bug** If the loop is dynamic, with negative step size, the for loop is wrong. In the example below, it does not enter the for loop at all....

bug

? - Needs Triage

[FEA][CuteDSL] Expose nvvm warp vote instruction

**Is your feature request related to a problem? Please describe.** I'm planning using these warp vote instructions to coordinate threads in the same warp. **Describe the solution you'd like** Can...

feature request

? - Needs Triage

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

**Describe the bug** The SASS code for UMMA uses more registers than necessary: the registers holding tmem address and idesc keep changing between instructions. This is important as it affects...

bug

CuTe DSL