TensorRT-LLM The llava model batch inference result is different with batch=1

System info

GPU: A100 tensorrt 9.3.0.post12.dev1 tensorrt-llm 0.9.0 torch 2.2.2

Reproduction

export MODEL_NAME="llava-1.5-7b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

python ../llama/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --gemm_plugin float16 \
    --use_fused_mlp \
    --max_batch_size 16 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --max_multimodal_len 9216 # 1 (max_batch_size) * 576 (num_visual_features)

python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type llava # or "--model_type vila" for VILA

python run.py \
    --max_new_tokens 20 \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --decoder_llm \
    --input_text "Question: which city is this? Answer:"
    --batch_size 16

if I use the same data to form a batch，the result like this:

and if I use two different prompt to form a batch，the reslt like this:

The image used is : https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png

Jun 26 '24 07:06 lss15151161

I saw similar results with llama3. Mine was resolved when I disabled 'use_custom_all_reduce' in compilation

Jun 26 '24 12:06 TheCodeWrangler

Could you try the latest versoin TRT_LLM 0.11+ https://nvidia.github.io/TensorRT-LLM/installation/linux.html

Jun 27 '24 00:06 hijkzzz

Could you try the latest versoin TRT_LLM 0.11+ https://nvidia.github.io/TensorRT-LLM/installation/linux.html

trt_llm 0.10 and 0.11. when I import tensorrt_llm,

Could you try the latest versoin TRT_LLM 0.11+ https://nvidia.github.io/TensorRT-LLM/installation/linux.html

if I install TRT_LLM 0.11, will occur: ModuleNotFoundError: No module named 'tensorrt_llm.bindings.BuildInfo' do you know how to solve it?

Jul 02 '24 03:07 lss15151161

use_custom_all_reduce

Does this parameter affect a single gpu? I dont use tp or pp

Jul 03 '24 03:07 lss15151161

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Aug 17 '24 01:08 github-actions[bot]