GemmKernels.jl issues

Tensor contractions

4

This PR adds Tensor Contraction functionality to GemmKernels.jl using the GEMM-like Tensor Tensor (GETT) multiplication algorithm. The API mimics the cuTENSOR API. It is still a draft, the benchmark scripts...

wardvermeulen

Add new pipelined kernel

Add alternative pipelining kernel. Compared to the old pipelining kernel, the loads/stores are reordered somewhat, and shared memory is split in two stages. This reduces the number of necessary bar.syncs...

thomasfaingnaert

Resolve remaining issues with benchmarking

- [ ] GitHub comment is too long (ref. https://github.com/JuliaGPU/GemmKernels.jl/pull/186#issuecomment-1912468519) - [ ] Benchmarks take a long time, requiring us to bump the timeout to 2h+. We should either reduce...

thomasfaingnaert

Skip configurations with fewer than 4 warps in tuning

2

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during...

thomasfaingnaert

Incomplete vectorisation of FP16 loads and stores

Something I noticed while working on optimisations for Volta. Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. `ld.shared.v4.b32`, as expected....

thomasfaingnaert

Check size limits of LocalArray

2

Extracted from #179

thomasfaingnaert

Parameter tuning

1

Some of the choices of parameters are currently far from optimal, as quickly explored using the following script: ```julia using CUDA, GemmKernels using Hyperopt using Octavian # we don't need...

maleadt

FPU operator issues

3

While tuning, I encountered a couple of compilation errors that either should be fixed in the kernel, or generate a ConfigError. ``` GemmKernels.Config{(M = 4096, N = 4096, K =...

maleadt

Adding more Semirings

1

Is there anything necessary to add more semirings except adding more `GeneralFPUOp`s? It looks like you have the `max +` semiring already.

rayegun

Replace LocalArray with SArray

At this point, I don't think there's much value in using our custom LocalArray instead of SArray, which also seems to support immutable versions of getindex/setindex.

maleadt

GemmKernels.jl
GemmKernels.jl copied to clipboard

Metadata

Tensor contractions

Add new pipelined kernel

Resolve remaining issues with benchmarking

Skip configurations with fewer than 4 warps in tuning

Incomplete vectorisation of FP16 loads and stores

Check size limits of LocalArray

Parameter tuning

FPU operator issues

Adding more Semirings

Replace LocalArray with SArray

← Metadata

Owner

Metadata

GemmKernels.jl GemmKernels.jl copied to clipboard

Metadata

← Metadata

Owner

Metadata

GemmKernels.jl
GemmKernels.jl copied to clipboard