TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

The weights compiled by v0.11.0 are run using TritonServer, and the concurrency is lower than that of the v0.10.0 version. The compilation scripts are the same.

Open white-wolf-tech opened this issue 1 year ago • 0 comments

System Info

GPU: A100

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

the following is v0.11.0 build script:

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --nccl_plugin disable \
                --paged_state disable \
                --tokens_per_block 16 \
                --use_custom_all_reduce disable

v0.10.0 is

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --max_batch_size 128 \
                --max_input_len 2048 \
                --max_num_tokens 32768 \
                --tokens_per_block 16 \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --use_custom_all_reduce disable \
                --strongly_typed

The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%. The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0. What could this problem be? Has anyone encountered it?

Expected behavior

same with v0.10.0 version

actual behavior

slower than v0.10.0 version

additional notes

No

white-wolf-tech avatar Aug 15 '24 02:08 white-wolf-tech