TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

support Qwen2-VL

Open junwenZhang opened this issue 1 year ago • 5 comments

System Info

qwen2-vl added new features of M-ROPE, please support it

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

qwen2-vl open source model

Expected behavior

tensorrt-llm support

actual behavior

tensorrt-llm not support

additional notes

no

junwenZhang avatar Sep 03 '24 09:09 junwenZhang

Hi, I'll do it.

sunnyqgg avatar Sep 04 '24 06:09 sunnyqgg

where to find PR files for qwen2-vl used by tensorrt-llm

scdotbox avatar Sep 19 '24 08:09 scdotbox

Any updates?

zhaocc1106 avatar Sep 24 '24 03:09 zhaocc1106

Hi, the work is in progress, I'll update it ASAP.

sunnyqgg avatar Sep 24 '24 05:09 sunnyqgg

Any updates?

Chenhaolin6 avatar Oct 08 '24 10:10 Chenhaolin6

Any updates?

junwenZhang avatar Oct 21 '24 09:10 junwenZhang

Any updates?

GuangyanZhang avatar Oct 22 '24 06:10 GuangyanZhang

@sunnyqgg Is there a clear timeline to complete the model? thanks.

chenqy4933 avatar Oct 24 '24 01:10 chenqy4933

Hi all, This is supposed to merge into main in early November.

pianogGG avatar Oct 25 '24 02:10 pianogGG

@pianogGG @sunnyqgg Hi, is this update available? Or is there any branch we can use first? Thanks

fan-niu avatar Nov 11 '24 01:11 fan-niu

Hi, the code is under review and almost done, it'll be public soon.

sunnyqgg avatar Nov 12 '24 03:11 sunnyqgg

Hi, the code is under review and almost done, it'll be public soon.

Hi, is there any update yet?

Hukongtao avatar Nov 18 '24 06:11 Hukongtao

any updates?

linccnu avatar Nov 19 '24 06:11 linccnu

How is the progress?

LugerW-A avatar Nov 19 '24 09:11 LugerW-A

It's supported, pls see examples/multimodal for more info.

sunnyqgg avatar Nov 20 '24 02:11 sunnyqgg

It's supported, pls see examples/multimodal for more info.

Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations?

peki12345 avatar Nov 22 '24 00:11 peki12345

It's supported, pls see examples/multimodal for more info.

Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations?

I've encountered the same situation. For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm.

LugerW-A avatar Nov 26 '24 03:11 LugerW-A

Hi @LugerW-A

  • For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public next week. @peki12345 GPU memory usage===> for ViT part or LLM part?

sunnyqgg avatar Nov 26 '24 10:11 sunnyqgg

Hi @LugerW-A

  • For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public next week. @peki12345 GPU memory usage===> for ViT part or LLM part?

@sunnyqgg thanks for your contribution! @kaiyux Hi could you help to public this changes? thanks a lot

@sunnyqgg @kaiyux hi currently tensorrtllm_backend does not support the qwen2-vl model. Is there a solution for this? Or can you tell us how to add support to tensorrtllm_backend? Thanks !

fan-niu avatar Nov 27 '24 02:11 fan-niu

Any update in this week @sunnyqgg

alimgl-pixel avatar Dec 04 '24 07:12 alimgl-pixel

Hi, Updated and pls try latest main code.

Thanks.

sunnyqgg avatar Dec 11 '24 09:12 sunnyqgg

Hi, Updated and pls try latest main code.

Thanks.

@sunnyqgg you mean currently tensorrtllm_backend already supported the qwen2-vl model ? Is it tensorrtllm_backend version v0.15.0?

fan-niu avatar Dec 11 '24 09:12 fan-niu

Hi @fan-niu , Unfortunately as for as I know, tensorrtllm_backend shuld have no support qwen2-vl, and I'm not sure if anyone is working on it.

Thanks.

sunnyqgg avatar Dec 12 '24 09:12 sunnyqgg

@sunnyqgg Because tensorrtllm backend is too closed-source, do you have any suggestions for me to implement this feature on tensorrtllm backend? thanks!

fan-niu avatar Dec 12 '24 09:12 fan-niu

I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. Also, i found the performance is lower than vllm.

message:

min_pixels = 4 * 28 * 28
max_pixels = 1024 * 1024 / 4
messages = [
                {
                    "role": "user",
                     "content": [
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/panda.jpg",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/cat.png",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {"type": "text", "text": "描述一下两张图片的不同。"},
                     ],
                }
]

hf output:

这两个图片的不同之处在于它们展示的动物种类不同。第一张图片展示的是一只小熊猫,而第二张图片展示的是一只猫。此外,这两张图片的背景也不同,第一张图片的背景是一棵树,而第二张图片的背景是一块灰色的水泥地。

trtllm output:

这两张图片看起来完全相同,都是一个红色的动物,头靠在木板上,背景是树干和树叶。

zhaocc1106 avatar Dec 13 '24 10:12 zhaocc1106

I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. Also, i found the performance is lower than vllm.

message:

min_pixels = 4 * 28 * 28
max_pixels = 1024 * 1024 / 4
messages = [
                {
                    "role": "user",
                     "content": [
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/panda.jpg",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {
                             "type": "image",
                             "image": "file:///tmp/tmp.FD4KtwMJkZ/data/cat.png",
                             "min_pixels": min_pixels,
                             "max_pixels": max_pixels,
                         },
                         {"type": "text", "text": "描述一下两张图片的不同。"},
                     ],
                }
]

hf output:

这两个图片的不同之处在于它们展示的动物种类不同。第一张图片展示的是一只小熊猫,而第二张图片展示的是一只猫。此外,这两张图片的背景也不同,第一张图片的背景是一棵树,而第二张图片的背景是一块灰色的水泥地。

trtllm output:

这两张图片看起来完全相同,都是一个红色的动物,头靠在木板上,背景是树干和树叶。

VisionAttentionOpt code can change as following. Then multi pics request will be ok. https://github.com/NetEase-Media/grps_trtllm/blob/b7bde55c177314621311aed8bc060c6deb9a0ed5/tools/qwen2vl/build_vit_engine.py#L222 - 248

But at now, i found the performance is lower than vllm when multi concurrent request. That 1~2 concurency is better than vllm.

zhaocc1106 avatar Dec 17 '24 12:12 zhaocc1106

Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,

           attention_mask_vit = torch.full([1, seq_length, seq_length],
                                           torch.finfo(torch.float16).min,
                                           device=image.device,
                                           dtype=image.dtype)
           for i in range(1, len(cu_seqlens)):
               attention_mask_vit[..., cu_seqlens[i - 1]:cu_seqlens[i],
                                  cu_seqlens[i - 1]:cu_seqlens[i]] = 0

please let me know if there have any other issues.

Thanks.

sunnyqgg avatar Dec 17 '24 14:12 sunnyqgg

Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,

           attention_mask_vit = torch.full([1, seq_length, seq_length],
                                           torch.finfo(torch.float16).min,
                                           device=image.device,
                                           dtype=image.dtype)
           for i in range(1, len(cu_seqlens)):
               attention_mask_vit[..., cu_seqlens[i - 1]:cu_seqlens[i],
                                  cu_seqlens[i - 1]:cu_seqlens[i]] = 0

please let me know if there have any other issues.

Thanks.

I think this is indeed the cause of the bug. By debuging, i found qwen2vl use VisionSdpaAttention default. I use it to instead of VisionAttention, and found can also fix it.

zhaocc1106 avatar Dec 17 '24 15:12 zhaocc1106

When processing multi concurrent request, it seems infer in queue. And time consumption seems to have increased by a multiple of the concurrency level, event though i have set --max_batch_size=4. But Internvl2(not use multi-rope op) will be ok. Is the bug of multi-rope op ?

@sunnyqgg

zhaocc1106 avatar Jan 17 '25 16:01 zhaocc1106

When processing multi concurrent request, it seems infer in queue. And time consumption seems to have increased by a multiple of the concurrency level, event though i have set --max_batch_size=4. But Internvl2(not use multi-rope op) will be ok. Is the bug of multi-rope op ?

@sunnyqgg

Could you take a look at this issue? @sunnyqgg

zhaocc1106 avatar Jan 21 '25 08:01 zhaocc1106