TensorRT-LLM support Qwen2-VL

System Info

qwen2-vl added new features of M-ROPE, please support it

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

qwen2-vl open source model

Expected behavior

tensorrt-llm support

actual behavior

tensorrt-llm not support

additional notes

no

Sep 03 '24 09:09 junwenZhang

Hi, I'll do it.

Sep 04 '24 06:09 sunnyqgg

where to find PR files for qwen2-vl used by tensorrt-llm

Sep 19 '24 08:09 scdotbox

Any updates?

Sep 24 '24 03:09 zhaocc1106

Hi, the work is in progress, I'll update it ASAP.

Sep 24 '24 05:09 sunnyqgg

Any updates?

Oct 08 '24 10:10 Chenhaolin6

Any updates?

Oct 21 '24 09:10 junwenZhang

Any updates?

Oct 22 '24 06:10 GuangyanZhang

@sunnyqgg Is there a clear timeline to complete the model? thanks.

Oct 24 '24 01:10 chenqy4933

Hi all， This is supposed to merge into main in early November.

Oct 25 '24 02:10 pianogGG

@pianogGG @sunnyqgg Hi, is this update available? Or is there any branch we can use first? Thanks

Nov 11 '24 01:11 fan-niu

Hi, the code is under review and almost done, it'll be public soon.

Nov 12 '24 03:11 sunnyqgg

Hi, the code is under review and almost done, it'll be public soon.

Hi, is there any update yet?

Nov 18 '24 06:11 Hukongtao

any updates?

Nov 19 '24 06:11 linccnu

How is the progress?

Nov 19 '24 09:11 LugerW-A

It's supported, pls see examples/multimodal for more info.

Nov 20 '24 02:11 sunnyqgg

It's supported, pls see examples/multimodal for more info.

Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations?

Nov 22 '24 00:11 peki12345

It's supported, pls see examples/multimodal for more info.

Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations?

I've encountered the same situation. For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm.

Nov 26 '24 03:11 LugerW-A

Hi @LugerW-A

For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public next week. @peki12345 GPU memory usage===> for ViT part or LLM part?

Nov 26 '24 10:11 sunnyqgg

Hi @LugerW-A

For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public next week. @peki12345 GPU memory usage===> for ViT part or LLM part?

@sunnyqgg thanks for your contribution! @kaiyux Hi could you help to public this changes? thanks a lot

@sunnyqgg @kaiyux hi currently tensorrtllm_backend does not support the qwen2-vl model. Is there a solution for this? Or can you tell us how to add support to tensorrtllm_backend? Thanks !

Nov 27 '24 02:11 fan-niu

Any update in this week @sunnyqgg

Dec 04 '24 07:12 alimgl-pixel

Hi, Updated and pls try latest main code.

Thanks.

Dec 11 '24 09:12 sunnyqgg

Hi, Updated and pls try latest main code.

Thanks.

@sunnyqgg you mean currently tensorrtllm_backend already supported the qwen2-vl model ? Is it tensorrtllm_backend version v0.15.0?

Dec 11 '24 09:12 fan-niu

Hi @fan-niu ， Unfortunately as for as I know， tensorrtllm_backend shuld have no support qwen2-vl, and I'm not sure if anyone is working on it.

Thanks.

Dec 12 '24 09:12 sunnyqgg

@sunnyqgg Because tensorrtllm backend is too closed-source, do you have any suggestions for me to implement this feature on tensorrtllm backend? thanks！

Dec 12 '24 09:12 fan-niu

I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. Also, i found the performance is lower than vllm.

message:

min_pixels = 4 * 28 * 28
max_pixels = 1024 * 1024 / 4
messages = [
                {
                    "role": "user",
                     "content": [
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/panda.jpg",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/cat.png",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {"type": "text", "text": "描述一下两张图片的不同。"},
                     ],
                }
]

hf output:

这两个图片的不同之处在于它们展示的动物种类不同。第一张图片展示的是一只小熊猫，而第二张图片展示的是一只猫。此外，这两张图片的背景也不同，第一张图片的背景是一棵树，而第二张图片的背景是一块灰色的水泥地。

trtllm output:

这两张图片看起来完全相同，都是一个红色的动物，头靠在木板上，背景是树干和树叶。

Dec 13 '24 10:12 zhaocc1106

I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. Also, i found the performance is lower than vllm.

message:

min_pixels = 4 * 28 * 28
max_pixels = 1024 * 1024 / 4
messages = [
                {
                    "role": "user",
                     "content": [
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/panda.jpg",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/cat.png",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {"type": "text", "text": "描述一下两张图片的不同。"},
                     ],
                }
]

hf output:

这两个图片的不同之处在于它们展示的动物种类不同。第一张图片展示的是一只小熊猫，而第二张图片展示的是一只猫。此外，这两张图片的背景也不同，第一张图片的背景是一棵树，而第二张图片的背景是一块灰色的水泥地。

trtllm output:

这两张图片看起来完全相同，都是一个红色的动物，头靠在木板上，背景是树干和树叶。

VisionAttentionOpt code can change as following. Then multi pics request will be ok. https://github.com/NetEase-Media/grps_trtllm/blob/b7bde55c177314621311aed8bc060c6deb9a0ed5/tools/qwen2vl/build_vit_engine.py#L222 - 248

But at now, i found the performance is lower than vllm when multi concurrent request. That 1~2 concurency is better than vllm.

Dec 17 '24 12:12 zhaocc1106

Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,

           attention_mask_vit = torch.full([1, seq_length, seq_length],
                                           torch.finfo(torch.float16).min,
                                           device=image.device,
                                           dtype=image.dtype)
           for i in range(1, len(cu_seqlens)):
               attention_mask_vit[..., cu_seqlens[i - 1]:cu_seqlens[i],
                                  cu_seqlens[i - 1]:cu_seqlens[i]] = 0

please let me know if there have any other issues.

Thanks.

Dec 17 '24 14:12 sunnyqgg

Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,
           attention_mask_vit = torch.full([1, seq_length, seq_length],
                                           torch.finfo(torch.float16).min,
                                           device=image.device,
                                           dtype=image.dtype)
           for i in range(1, len(cu_seqlens)):
               attention_mask_vit[..., cu_seqlens[i - 1]:cu_seqlens[i],
                                  cu_seqlens[i - 1]:cu_seqlens[i]] = 0
please let me know if there have any other issues.

Thanks.

I think this is indeed the cause of the bug. By debuging, i found qwen2vl use VisionSdpaAttention default. I use it to instead of VisionAttention, and found can also fix it.

Dec 17 '24 15:12 zhaocc1106

When processing multi concurrent request, it seems infer in queue. And time consumption seems to have increased by a multiple of the concurrency level, event though i have set --max_batch_size=4. But Internvl2(not use multi-rope op) will be ok. Is the bug of multi-rope op ?

@sunnyqgg

Jan 17 '25 16:01 zhaocc1106

When processing multi concurrent request, it seems infer in queue. And time consumption seems to have increased by a multiple of the concurrency level, event though i have set --max_batch_size=4. But Internvl2(not use multi-rope op) will be ok. Is the bug of multi-rope op ?

@sunnyqgg

Could you take a look at this issue? @sunnyqgg

Jan 21 '25 08:01 zhaocc1106