TensorRT-LLM
TensorRT-LLM copied to clipboard
The weights compiled by v0.11.0 are run using TritonServer, and the concurrency is lower than that of the v0.10.0 version. The compilation scripts are the same.
System Info
GPU: A100
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
the following is v0.11.0 build script:
python3 hf_convert_trtllm.py --model_dir $input_dir \
--output_dir $input_temp_dir \
--dtype float16 \
--calib_dataset $calib_dataset_path
trtllm-build --checkpoint_dir $input_temp_dir \
--output_dir $output_dir \
--remove_input_padding enable \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--use_fused_mlp \
--context_fmha enable \
--context_fmha_fp32_acc enable \
--multi_block_mode enable \
--nccl_plugin disable \
--paged_state disable \
--tokens_per_block 16 \
--use_custom_all_reduce disable
v0.10.0 is
python3 hf_convert_trtllm.py --model_dir $input_dir \
--output_dir $input_temp_dir \
--dtype float16 \
--calib_dataset $calib_dataset_path
trtllm-build --checkpoint_dir $input_temp_dir \
--output_dir $output_dir \
--remove_input_padding enable \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--max_batch_size 128 \
--max_input_len 2048 \
--max_num_tokens 32768 \
--tokens_per_block 16 \
--use_fused_mlp \
--context_fmha enable \
--context_fmha_fp32_acc enable \
--multi_block_mode enable \
--use_custom_all_reduce disable \
--strongly_typed
The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%. The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0. What could this problem be? Has anyone encountered it?
Expected behavior
same with v0.10.0 version
actual behavior
slower than v0.10.0 version
additional notes
No