TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

llava batch infer, only the result corresponding to the longest prompt is correct, while other results are incorrect

Open lss15151161 opened this issue 1 year ago • 3 comments

version: TensorRT-LLM 0.10.0 the official script(TensorRT-LLM/examples/multimodal/run.py) use same prompt repeat to form a batch. but if I use different prompts to form a batch, the result is incorrect. how to solve it? because the result corresponding to the longest prompt is correct, I think the reason is padding. image

if i use the same prompts, the result is correct image

lss15151161 avatar Jul 03 '24 03:07 lss15151161

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

amukkara avatar Jul 04 '24 18:07 amukkara

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

thx for reply~ so, do you know what should I do if I want to do batch inference?

lss15151161 avatar Jul 09 '24 08:07 lss15151161

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

and doesn't trtllm remove pads internally?

lss15151161 avatar Jul 09 '24 08:07 lss15151161

@lss15151161 , Thank you for raising this question about batch inference in LLaVA! And I'm sorry for the very delayed response. If you are still interested in the batch inference, I'm pretty sure you'll find "In-Flight batch" interesting. More details can be found here:

karljang avatar Aug 20 '25 22:08 karljang