TensorRT-LLM
TensorRT-LLM copied to clipboard
Question on how to perform cross-attention with FMHA kernel
I am interested in performing multimodal cross-attention. I don't see issues in performing self-attention in encoder since i can use the BertAttention plugin. However, cross-attention would have query from one modality (with seq_len as x and key/value from another modality (with seq_len as y). It's possible that the 2 modalities have different sequence lengths.
Can I please get some guidance on how to accomplish this?
- Is there a way to perform FMHA when
qandkvhave differentseq_len. Afaik, theBertAttentionplugin switches to non-FMHA path for this. - Can I modify the plugin or hack around this to ensure I use
FMHA?