GemmKernels.jl icon indicating copy to clipboard operation
GemmKernels.jl copied to clipboard

Flexible and performant GEMM kernels in Julia

Results 17 GemmKernels.jl issues
Sort by recently updated
recently updated
newest added

This PR adds Tensor Contraction functionality to GemmKernels.jl using the GEMM-like Tensor Tensor (GETT) multiplication algorithm. The API mimics the cuTENSOR API. It is still a draft, the benchmark scripts...

Add alternative pipelining kernel. Compared to the old pipelining kernel, the loads/stores are reordered somewhat, and shared memory is split in two stages. This reduces the number of necessary bar.syncs...

- [ ] GitHub comment is too long (ref. https://github.com/JuliaGPU/GemmKernels.jl/pull/186#issuecomment-1912468519) - [ ] Benchmarks take a long time, requiring us to bump the timeout to 2h+. We should either reduce...

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during...

Something I noticed while working on optimisations for Volta. Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. `ld.shared.v4.b32`, as expected....

Some of the choices of parameters are currently far from optimal, as quickly explored using the following script: ```julia using CUDA, GemmKernels using Hyperopt using Octavian # we don't need...

While tuning, I encountered a couple of compilation errors that either should be fixed in the kernel, or generate a ConfigError. ``` GemmKernels.Config{(M = 4096, N = 4096, K =...

Is there anything necessary to add more semirings except adding more `GeneralFPUOp`s? It looks like you have the `max +` semiring already.

At this point, I don't think there's much value in using our custom LocalArray instead of SArray, which also seems to support immutable versions of getindex/setindex.