TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

How does TensorRT leverage attention masks to speed up inference ?

Open MatthieuToulemont opened this issue 10 months ago • 4 comments

Hello team,

Thanks for all the great work,

I am training a model where I am providing tile-wise constant attention masks (see picture below). At inference time, how will TensorRT leverage this type of attention mask to speed up inference ?

Image

MatthieuToulemont avatar Apr 04 '25 08:04 MatthieuToulemont

cc @nvyihengz for thoughts?

yuanyao-nv avatar Apr 22 '25 20:04 yuanyao-nv

Performance-wise currently we don't have attention kernels optimized for such masks. Functional wise it should work but the attention might not be fused.

nvyihengz avatar Apr 22 '25 21:04 nvyihengz

Ok, does TensorRT optimise attention with attention masks at all or never ?

MatthieuToulemont avatar Apr 25 '25 12:04 MatthieuToulemont

It should be able to fuse (which would save some data movement) but might not be optimized (in the sense of skipping computations). This doc contains some relevant info: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html#multi-head-attention-fusion To check whether fusion is happening you can use --dumpLayerInfo to see if MHA layer is created. --dumpProfile can give you a breakdown of the perf data.

yuanyao-nv avatar Apr 25 '25 21:04 yuanyao-nv