TensorRT Custom Attention implementation not well optimised by TensorRT

Description

Hi, I am working on a model that employs a modified attention mechanism, incorporating pooling on top of K and V to reduce computational load. However, I'm encountering optimization issues in TensorRT, particularly with my model not being efficiently optimized. I suspect this is due to the unique implementation of attention, which might not align with the MYELIN/Flash Attention pattern matching support in TensorRT.

MiniModelV1TensorRT_optimized.onnx: When compiled with TensorRT using specific commands (including --fp16, setting layerPrecisions, and defining input/output formats), it shows an inference time of 10 ms and a GPU memory footprint of 1556 MB.
MiniModelV2TensorRT_optimized.onnx: Compiled similarly, it shows an inference time of 10.9 ms but with a significantly lower memory footprint of 54 MB.

The key difference is that MiniModelV2 appears to be optimized with Myelin and utilizes the MHA operator at runtime, as per the nsys report. In contrast, MiniModelV1's results, especially its large memory footprint, suggest a lack of effective optimization, potentially lacking Myelin and Flash Attention integration.

Do you have any suggestions on how to address these optimization challenges, particularly for forcing Myelin support in ModelV1? Any insights or feedback on managing memory more efficiently in this context would be greatly appreciated.

ONNX Model V1 and V2:

Model V1 :

model_v1

Model V2:

model_v2

Environment

TensorRT Version: 8.6

NVIDIA GPU: A10g

Dec 05 '23 15:12 david-PHR

I second this. It would also be particularly useful to know how the myelin recognises the attention to use the Flash attention plugin.

Looking at diffusion models, tons of compute is wasted by reprocessing the text embedding at every step of the model in the cross attention when we could compute them only once at the beginning of the loop. But doing so we might break the pattern matching of Flash Attention and loose a lot of inference speed.

Dec 05 '23 15:12 MatthieuToulemont

@nvpohanh @zhenhuaw-me ^ ^

Dec 07 '23 12:12 zerollzeng

In addition, I would like to know if there is a way to access the cubin file for the multi-heads cross-attention plugin for head dimension == 32. Or even the CUBIN generator for these plugins In the TensorRT OSS archive there are no support for heads dimension of 32 (64 minimum).

Dec 07 '23 15:12 david-PHR

Could you try TensorRT 9.2 release? https://github.com/NVIDIA/TensorRT/tree/release/9.2#setting-up-the-build-environment

We have relaxed the MHA pattern matching constraint by a lot between 9.2 and 8.6. Thanks

Dec 08 '23 03:12 nvpohanh

Hello, thanks for the recommendation.

We looked at this already, but unfortunately it seems Nvidia is not going to include those changes in the NGC containers anytime soon and @david-PHR and I use the Triton Inference Server container to run our models in production.

Would you have any guidelines / steps that we could follow to make sure that Triton Inference Server is compatible with TensorRT 9.2 ?

Dec 08 '23 08:12 MatthieuToulemont

You can follow these commands to install TRT 9.2 in Triton Inference Server container: https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile#L92-L95

Dec 08 '23 08:12 nvpohanh

Is it compatible with Torch-TensorRT?

Dec 08 '23 08:12 david-PHR

@nvpohanh thanks for the link. When I compile a model with the trtexec executable 9.2 (+ all the libraries installed) I encounter this issue when I try to execute the serialized engine from a tensorrt python wrapper: [12/08/2023-13:57:44] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReaderInitCommon::46] Error Code 1: Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 236, Serialized Engine Version: 228)

Any idea?

Dec 08 '23 14:12 david-PHR

I've found the issue, I had to recompile the plugins from TensorRT OSS 9.2 + set up several env variables in my docker session. I still have an issue with Torch-TensorRT that produces SegFault with this new TensorRT installed. It could be useful to have somewhere all the clear steps to upgrade each TensorRT component in a docker session (NGC container for example). Guidelines:

TensorRT source libraries
TensorRT OSS compilation steps
TensorRT OSS installation steps
Torch-TensorRT upgrade following the new TensorRT installed

Dec 08 '23 14:12 david-PHR

For some models I am experiencing this issue : [12/08/2023-14:49:41] [TRT] [W] Using default stream in enqueue()/enqueueV2()/enqueueV3() may lead to performance issues due to additional cudaDeviceSynchronize() calls by TensorRT to ensure correct synchronizations. Please use non-default stream instead. mem_utils.cpp:72: CHECK(status == cudaSuccess) failed. cudaMemcpy2DAsync failed. [12/08/2023-14:49:41] [TRT] [E] 1: [runner.cpp::executeMyelinGraph::682] Error Code 1: Myelin (No Myelin Error exists) [12/08/2023-14:49:41] [TRT] [E] 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (invalid argument) Traceback (most recent call last): Is it expected?

Dec 08 '23 14:12 david-PHR

It seems that Torch-TRT is using default stream to call TRT engine, which is not recommended. Let me ask the Torch-TRT internally about this issue.

Meanwhile, could you work around this issue by wrapping your PyTorch code with:

s = torch.cuda.Stream()  # Create a new stream.
with torch.cuda.stream(s):
    # call PyTorch module foward() here

Dec 11 '23 02:12 nvpohanh

@david-PHR hi ~ would you like to share your nsys profile results? I have encounter a similar problem that 'Myelin fused Attn but not run at MHA Kernel', see:

https://github.com/NVIDIA/TensorRT/issues/3620

Jan 23 '24 07:01 DefTruth

closing since no activity for more than 3 weeks per our policy, thanks all!

May 07 '24 18:05 ttyio