Bert Maher comments

Results 32 comments of


                                            Bert Maher

Can Triton support split-k matrix multiplication

The provided triton.ops.matmul appears to do so: https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py#L96

how does redex play within gradle builds ?

I would love for someone to create this, but we don't use gradle at FB, so we've got no expertise to do it at the moment.

how does redex play within gradle builds ?

@timmutton that's awesome!

how does redex play within gradle builds ?

We use buck: https://buckbuild.com/ :-)

Persistent softmax kernel

@jeromeku the benchmark results show achieved memory bandwidth (GB/s), so Triton is the fastest. `torch.jit` is actually the super-old torchscript backend, which in this case doesn't even do any fusion....

Cuda context not properly initialized in side threads when no compilation is needed

#3731 uses the `cudaFree(0)` approach, although I'm not sure I love it b/c the rest of that file uses the driver API... happy to switch to a different approach to...

Cuda context not properly initialized in side threads when no compilation is needed

Sigh, there's more complexity here involving devices, too. I can get the same crash without threads: ``` import torch import triton import triton.language as tl @triton.jit def _rua_kernel(hidden_sh_ptr): return def...

[inductor] Disable fp contraction and add option to use precise division

Just curious, if we didn't use `fma` here, and instead computed `x*1e6 + y*1e6`, would we still get the wrong result here? I'm kind of wondering if fma in general...

Improve scheduling of prefetched dot

To help with reviewing, here are some IR dumps from a basic matmul example. * First, the TTGIR before the prefetch pass happens: https://gist.github.com/bertmaher/9cbf5206ef5d8d5b88fbaee032ab650f * TTGIR after prefetch, without this...

atomic_add slows down attention backwards due to layout conversions

Thanks for the suggestion @lijinpei! I am curious though if that is a generally desirable change. I would have expected that converting from `mma` to `blocked` layout would generally improve...