Question regarding custom kernel implementation for FMHA Cross attention
Hi, I am aware that implementation and source code of kernels like FMHA is not released. However, is there a guide or some reference I can use to create custom kernels related to attention? I would ideally not like to develop something totally from scratch. Specifically, I am interested in implementing a fused kernel for cross-attention.
Currently, FMHA doesn't support cross-attention, but is there a hacky way to use it if not for unfused operations? In the case of cross-attention, only the sequence length for K and V is different from that of Q. Can I leverage the non-padded (packed) mode of attention kernel to perform cross-attention with the same kernel, since the packed mode can handle variable length input sequences. Thanks!