TensorRT-LLM llava batch infer, only the result corresponding to the longest prompt is correct, while other results are incorrect

version: TensorRT-LLM 0.10.0 the official script(TensorRT-LLM/examples/multimodal/run.py) use same prompt repeat to form a batch. but if I use different prompts to form a batch, the result is incorrect. how to solve it? because the result corresponding to the longest prompt is correct, I think the reason is padding.

if i use the same prompts, the result is correct

Jul 03 '24 03:07 lss15151161

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

Jul 04 '24 18:07 amukkara

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

thx for reply~ so, do you know what should I do if I want to do batch inference?

Jul 09 '24 08:07 lss15151161

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

and doesn't trtllm remove pads internally?

Jul 09 '24 08:07 lss15151161

@lss15151161 , Thank you for raising this question about batch inference in LLaVA! And I'm sorry for the very delayed response. If you are still interested in the batch inference, I'm pretty sure you'll find "In-Flight batch" interesting. More details can be found here:

Aug 20 '25 22:08 karljang