sunnyqgg
sunnyqgg
Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If...
Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT...
Hi @hezeli123 , the diffs are smaller compared with TRT 9.x, does the current ViT diffs have a big impact on the final results? If so, you can try to...
OK, if you have strong desire to use FP16, I'll continue to look at this issue, if not, this issue will have a lower priority.
Hi @jdmdj1999 @chiquitita-101 what TRT version you are using and what kind of quantization method you're using for Qwen?
Hi, I'll do it.
Hi, the work is in progress, I'll update it ASAP.
Hi, the code is under review and almost done, it'll be public soon.
It's supported, pls see examples/multimodal for more info.
Hi @LugerW-A - For the Qwen2-VL 2B model, TRT_LLM is more than twice as slow as vllm. I have noticed this issue and fixed it already, hope it'll be public...