Results 32 comments of Bert Maher

The provided triton.ops.matmul appears to do so: https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py#L96

I would love for someone to create this, but we don't use gradle at FB, so we've got no expertise to do it at the moment.

We use buck: https://buckbuild.com/ :-)

@jeromeku the benchmark results show achieved memory bandwidth (GB/s), so Triton is the fastest. `torch.jit` is actually the super-old torchscript backend, which in this case doesn't even do any fusion....

#3731 uses the `cudaFree(0)` approach, although I'm not sure I love it b/c the rest of that file uses the driver API... happy to switch to a different approach to...

Sigh, there's more complexity here involving devices, too. I can get the same crash without threads: ``` import torch import triton import triton.language as tl @triton.jit def _rua_kernel(hidden_sh_ptr): return def...

Just curious, if we didn't use `fma` here, and instead computed `x*1e6 + y*1e6`, would we still get the wrong result here? I'm kind of wondering if fma in general...

To help with reviewing, here are some IR dumps from a basic matmul example. * First, the TTGIR before the prefetch pass happens: https://gist.github.com/bertmaher/9cbf5206ef5d8d5b88fbaee032ab650f * TTGIR after prefetch, without this...

Thanks for the suggestion @lijinpei! I am curious though if that is a generally desirable change. I would have expected that converting from `mma` to `blocked` layout would generally improve...