llava batch infer, only the result corresponding to the longest prompt is correct, while other results are incorrect
version: TensorRT-LLM 0.10.0
the official script(TensorRT-LLM/examples/multimodal/run.py) use same prompt repeat to form a batch. but if I use different prompts to form a batch, the result is incorrect. how to solve it?
because the result corresponding to the longest prompt is correct, I think the reason is padding.
if i use the same prompts, the result is correct
@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.
@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.
thx for reply~ so, do you know what should I do if I want to do batch inference?
@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.
@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.
and doesn't trtllm remove pads internally?
@lss15151161 , Thank you for raising this question about batch inference in LLaVA! And I'm sorry for the very delayed response. If you are still interested in the batch inference, I'm pretty sure you'll find "In-Flight batch" interesting. More details can be found here: