CUDA-Learn-Notes
CUDA-Learn-Notes copied to clipboard
Kernel Trace issue
TODO
- [ ] swish kernel
- [ ] gelu kernel
- [ ] RoPE kernel
- [x] pack elementwise_add
- [x] pack sigmoid
- [x] pack relu
- [x] histogram
- [x] warp/block reduce
- [x] softmax
- [x] pack safe_softmax
- [x] pack layer-norm
- [x] pack rms-norm
- [x] flash-attn-1 f32
- [ ] flash-attn-2 f32
- [ ] flash-attn-2 f16
- [x] MMA(Tensor Cores) flash-attn-2 f16
- [x] warp segmv
- [x] warp hgemv
- [x] bank confilcts reduce sgemm
- [x] pipeling sgemm
- [ ] split_k sgemm
- [x] pack LDST hgemm
- [x] bank confilcts reduce hgemm
- [x] pipeling hgemm
- [ ] split_k hgemm
- [x] cp.async hgemm
- [x] cp.async sgemm
- [ ] stage3+cp.async/cp.async.reduce.bulk hgemm
- [ ] WMMA API(Tensor Cores) hgemm
- [ ] MMA PTX(Tensor Cores) hgemm
- [ ] pack online_safe_softmax
- [ ] cp.async.reduce.bulk block_all_reduce
- [ ] ...