TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Question on how to perform cross-attention with FMHA kernel

Open Ashwin-Ramesh2607 opened this issue 1 year ago • 0 comments

I am interested in performing multimodal cross-attention. I don't see issues in performing self-attention in encoder since i can use the BertAttention plugin. However, cross-attention would have query from one modality (with seq_len as x and key/value from another modality (with seq_len as y). It's possible that the 2 modalities have different sequence lengths.

Can I please get some guidance on how to accomplish this?

  1. Is there a way to perform FMHA when q and kv have different seq_len. Afaik, the BertAttention plugin switches to non-FMHA path for this.
  2. Can I modify the plugin or hack around this to ensure I use FMHA?

Ashwin-Ramesh2607 avatar Jul 16 '24 17:07 Ashwin-Ramesh2607