Fused_MHA does not support when seq_len = 1024, Dh==72, causal_mask==false

Open liangzelang opened this issue 1 year ago • 1 comments

When I use FMHA_v2, I found it does not support my scenes. So i wonder is there any way to use fmha_v2 except changing model. Thx a lot.

Sep 02 '24 09:09 liangzelang

From https://github.com/NVIDIA/TensorRT/tree/release/10.3/plugin/bertQKVToContextPlugin#parameters
Current v2 plugin Add head size 32 support when sequence length is 128, 256 or 512 in v2 plugin, see https://github.com/NVIDIA/TensorRT/tree/release/10.3/plugin/bertQKVToContextPlugin/fused_multihead_attention_v2/src about 1024 cubin are not provided. Maybe you can use cutlass to impl your case.

Sep 04 '24 10:09 lix19937