GemmKernels.jl
GemmKernels.jl copied to clipboard
Flexible and performant GEMM kernels in Julia
This PR adds Tensor Contraction functionality to GemmKernels.jl using the GEMM-like Tensor Tensor (GETT) multiplication algorithm. The API mimics the cuTENSOR API. It is still a draft, the benchmark scripts...
Add alternative pipelining kernel. Compared to the old pipelining kernel, the loads/stores are reordered somewhat, and shared memory is split in two stages. This reduces the number of necessary bar.syncs...
- [ ] GitHub comment is too long (ref. https://github.com/JuliaGPU/GemmKernels.jl/pull/186#issuecomment-1912468519) - [ ] Benchmarks take a long time, requiring us to bump the timeout to 2h+. We should either reduce...
Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during...
Something I noticed while working on optimisations for Volta. Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. `ld.shared.v4.b32`, as expected....
Extracted from #179
Some of the choices of parameters are currently far from optimal, as quickly explored using the following script: ```julia using CUDA, GemmKernels using Hyperopt using Octavian # we don't need...
While tuning, I encountered a couple of compilation errors that either should be fixed in the kernel, or generate a ConfigError. ``` GemmKernels.Config{(M = 4096, N = 4096, K =...
Is there anything necessary to add more semirings except adding more `GeneralFPUOp`s? It looks like you have the `max +` semiring already.
At this point, I don't think there's much value in using our custom LocalArray instead of SArray, which also seems to support immutable versions of getindex/setindex.