How does TensorRT leverage attention masks to speed up inference ?
Hello team,
Thanks for all the great work,
I am training a model where I am providing tile-wise constant attention masks (see picture below). At inference time, how will TensorRT leverage this type of attention mask to speed up inference ?
cc @nvyihengz for thoughts?
Performance-wise currently we don't have attention kernels optimized for such masks. Functional wise it should work but the attention might not be fused.
Ok, does TensorRT optimise attention with attention masks at all or never ?
It should be able to fuse (which would save some data movement) but might not be optimized (in the sense of skipping computations).
This doc contains some relevant info: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html#multi-head-attention-fusion
To check whether fusion is happening you can use --dumpLayerInfo to see if MHA layer is created. --dumpProfile can give you a breakdown of the perf data.