support Qwen2-VL
System Info
qwen2-vl added new features of M-ROPE, please support it
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
qwen2-vl open source model
Expected behavior
tensorrt-llm support
actual behavior
tensorrt-llm not support
additional notes
no
Hi, I'll do it.
where to find PR files for qwen2-vl used by tensorrt-llm
Any updates?
Hi, the work is in progress, I'll update it ASAP.
Any updates?
Any updates?
Any updates?
@sunnyqgg Is there a clear timeline to complete the model? thanks.
Hi all, This is supposed to merge into main in early November.
@pianogGG @sunnyqgg Hi, is this update available? Or is there any branch we can use first? Thanks
Hi, the code is under review and almost done, it'll be public soon.
Hi, the code is under review and almost done, it'll be public soon.
Hi, is there any update yet?
any updates?
How is the progress?
It's supported, pls see examples/multimodal for more info.
It's supported, pls see examples/multimodal for more info.
Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations?
It's supported, pls see examples/multimodal for more info.
Hi, Qwen2-VL can run successfully, but compared to directly import transformers, there is no significant improvement in time consumption and GPU memory usage. Is this within expectations?
I've encountered the same situation. For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm.
Hi @LugerW-A
- For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public next week. @peki12345 GPU memory usage===> for ViT part or LLM part?
Hi @LugerW-A
- For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public next week. @peki12345 GPU memory usage===> for ViT part or LLM part?
@sunnyqgg thanks for your contribution! @kaiyux Hi could you help to public this changes? thanks a lot
@sunnyqgg @kaiyux hi currently tensorrtllm_backend does not support the qwen2-vl model. Is there a solution for this? Or can you tell us how to add support to tensorrtllm_backend? Thanks !
Any update in this week @sunnyqgg
Hi, Updated and pls try latest main code.
Thanks.
Hi, Updated and pls try latest main code.
Thanks.
@sunnyqgg you mean currently tensorrtllm_backend already supported the qwen2-vl model ? Is it tensorrtllm_backend version v0.15.0?
Hi @fan-niu , Unfortunately as for as I know, tensorrtllm_backend shuld have no support qwen2-vl, and I'm not sure if anyone is working on it.
Thanks.
@sunnyqgg Because tensorrtllm backend is too closed-source, do you have any suggestions for me to implement this feature on tensorrtllm backend? thanks!
I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. Also, i found the performance is lower than vllm.
message:
min_pixels = 4 * 28 * 28
max_pixels = 1024 * 1024 / 4
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///tmp/tmp.FD4KtwMJkZ/data/panda.jpg",
"min_pixels": min_pixels,
"max_pixels": max_pixels,
},
{
"type": "image",
"image": "file:///tmp/tmp.FD4KtwMJkZ/data/cat.png",
"min_pixels": min_pixels,
"max_pixels": max_pixels,
},
{"type": "text", "text": "描述一下两张图片的不同。"},
],
}
]
hf output:
这两个图片的不同之处在于它们展示的动物种类不同。第一张图片展示的是一只小熊猫,而第二张图片展示的是一只猫。此外,这两张图片的背景也不同,第一张图片的背景是一棵树,而第二张图片的背景是一块灰色的水泥地。
trtllm output:
这两张图片看起来完全相同,都是一个红色的动物,头靠在木板上,背景是树干和树叶。
I found reduced accuracy and output error when two pics with qwen2-vl-7B, but one pic is ok. Also, i found the performance is lower than vllm.
message:
min_pixels = 4 * 28 * 28 max_pixels = 1024 * 1024 / 4 messages = [ { "role": "user", "content": [ { "type": "image", "image": "file:///tmp/tmp.FD4KtwMJkZ/data/panda.jpg", "min_pixels": min_pixels, "max_pixels": max_pixels, }, { "type": "image", "image": "file:///tmp/tmp.FD4KtwMJkZ/data/cat.png", "min_pixels": min_pixels, "max_pixels": max_pixels, }, {"type": "text", "text": "描述一下两张图片的不同。"}, ], } ]hf output:
这两个图片的不同之处在于它们展示的动物种类不同。第一张图片展示的是一只小熊猫,而第二张图片展示的是一只猫。此外,这两张图片的背景也不同,第一张图片的背景是一棵树,而第二张图片的背景是一块灰色的水泥地。trtllm output:
这两张图片看起来完全相同,都是一个红色的动物,头靠在木板上,背景是树干和树叶。
VisionAttentionOpt code can change as following. Then multi pics request will be ok. https://github.com/NetEase-Media/grps_trtllm/blob/b7bde55c177314621311aed8bc060c6deb9a0ed5/tools/qwen2vl/build_vit_engine.py#L222 - 248
But at now, i found the performance is lower than vllm when multi concurrent request. That 1~2 concurency is better than vllm.
Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,
attention_mask_vit = torch.full([1, seq_length, seq_length],
torch.finfo(torch.float16).min,
device=image.device,
dtype=image.dtype)
for i in range(1, len(cu_seqlens)):
attention_mask_vit[..., cu_seqlens[i - 1]:cu_seqlens[i],
cu_seqlens[i - 1]:cu_seqlens[i]] = 0
please let me know if there have any other issues.
Thanks.
Hi, please use the latest code which is public today, and for multi-batch accuracy please change the the attention_mask_vit in tensorrt_llm/runtime/multimodal_model_runner.py,
attention_mask_vit = torch.full([1, seq_length, seq_length], torch.finfo(torch.float16).min, device=image.device, dtype=image.dtype) for i in range(1, len(cu_seqlens)): attention_mask_vit[..., cu_seqlens[i - 1]:cu_seqlens[i], cu_seqlens[i - 1]:cu_seqlens[i]] = 0please let me know if there have any other issues.
Thanks.
I think this is indeed the cause of the bug. By debuging, i found qwen2vl use VisionSdpaAttention default. I use it to instead of VisionAttention, and found can also fix it.
When processing multi concurrent request, it seems infer in queue. And time consumption seems to have increased by a multiple of the concurrency level, event though i have set --max_batch_size=4. But Internvl2(not use multi-rope op) will be ok. Is the bug of multi-rope op ?
@sunnyqgg
When processing multi concurrent request, it seems infer in queue. And time consumption seems to have increased by a multiple of the concurrency level, event though i have set --max_batch_size=4. But Internvl2(not use multi-rope op) will be ok. Is the bug of multi-rope op ?
Could you take a look at this issue? @sunnyqgg