TensorRT-LLM
TensorRT-LLM copied to clipboard
Is there any feature related to GPT-like models that can be applied to BERT-like models?
Is there any fesature related to GPT-like models that can be applied to BERT-like models?
They have some common optimization idea, like fusing the multi head attention kernel, quantizing the model to int8 or fp8.
hI @byshiue, a related question. Does BertAttentionPlugin also use FlashAttention2 that GptAttention uses?
Yes.