Can not change max_input_len of Encoder while building engine in Encoder-Decoder model (T5)

Open thanhlt998 opened this issue 1 year ago • 0 comments

I built the engines for T5 model with the following scripts for the latest version of TensorRT-LLM:

export MODEL_DIR="path_to_t5_model" # or "flan-t5-small"
export MODEL_NAME="t5model"
export MODEL_TYPE="t5"
export INFERENCE_PRECISION="float16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=1
export OUTPUT_DIR="triton_model_repos/${MODEL_NAME}/tensorrt_llm/1"

trtllm-build --checkpoint_dir "${MODEL_DIR}/trt_models/${INFERENCE_PRECISION}/tp${TP_SIZE}/pp${PP_SIZE}/encoder" \
                --output_dir "${OUTPUT_DIR}/${WORLD_SIZE}-gpu/${INFERENCE_PRECISION}/tp${TP_SIZE}/encoder" \
                --paged_kv_cache enable \
                --moe_plugin disable \
                --enable_xqa disable \
                --use_custom_all_reduce disable \
                --max_beam_width ${MAX_BEAM_WIDTH} \
                --max_batch_size 8 \
                --max_input_len 4096 \
                --max_encoder_input_len 4096 \
                --max_output_len 4096 \
                --gemm_plugin ${INFERENCE_PRECISION} \
                --bert_attention_plugin ${INFERENCE_PRECISION} \
                --gpt_attention_plugin ${INFERENCE_PRECISION} \
                --remove_input_padding disable \
                --context_fmha disable

# For decoder, refer to the above content and set --max_input_len correctly
trtllm-build --checkpoint_dir "${MODEL_DIR}/trt_models/${INFERENCE_PRECISION}/tp${TP_SIZE}/pp${PP_SIZE}/decoder" \
                --output_dir "${OUTPUT_DIR}/${WORLD_SIZE}-gpu/${INFERENCE_PRECISION}/tp${TP_SIZE}/decoder" \
                --paged_kv_cache enable \
                --moe_plugin disable \
                --enable_xqa disable \
                --use_custom_all_reduce disable \
                --max_beam_width ${MAX_BEAM_WIDTH} \
                --max_batch_size 8 \
                --max_output_len 4096 \
                --max_encoder_input_len 4096 \
                --gemm_plugin ${INFERENCE_PRECISION} \
                --bert_attention_plugin ${INFERENCE_PRECISION} \
                --gpt_attention_plugin ${INFERENCE_PRECISION} \
                --remove_input_padding disable \
                --context_fmha disable \
                --max_input_len 1

Then I run inference with the built engines, it only works with input with the length <= 1024 although I built with --max_input_len=4096 and max_encoder_input_len=4096. When I run with the inputs (length > 1024), it raises the following error in the triton server log:

[05/16/2024-08:43:31] [TRT] [E] 3: [executionContext.cpp::setInputShape::2068] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2068, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)

I also edit some bug in line https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/runtime/session.py#L170 (f"engine supports [min, opt, max] = {self.engine.get_profile_shape(context.active_optimization_profile, name)}" -> f"engine supports [min, opt, max] = {self.engine.get_tensor_profile_shape(name, context.active_optimization_profile)}") to show the error:

I0516 07:36:03.073916 35685 pb_stub.cc:715] Failed to process the request(s) for model 'tensorrt_llm_0_0', message: ValueError: Couldn't assign input_ids with shape torch.Size([1, 1398]), engine supports [min, opt, max] = [(1, 1), (4, 512), (8, 1024)]

It seems like max_input_len 1024 set somewhere in the code?

May 16 '24 10:05 thanhlt998