TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Question regarding custom kernel implementation for FMHA Cross attention

Open Ashwin-Ramesh2607 opened this issue 1 year ago • 0 comments

Hi, I am aware that implementation and source code of kernels like FMHA is not released. However, is there a guide or some reference I can use to create custom kernels related to attention? I would ideally not like to develop something totally from scratch. Specifically, I am interested in implementing a fused kernel for cross-attention.

Currently, FMHA doesn't support cross-attention, but is there a hacky way to use it if not for unfused operations? In the case of cross-attention, only the sequence length for K and V is different from that of Q. Can I leverage the non-padded (packed) mode of attention kernel to perform cross-attention with the same kernel, since the packed mode can handle variable length input sequences. Thanks!

Ashwin-Ramesh2607 avatar May 23 '24 22:05 Ashwin-Ramesh2607