The weights compiled by v0.11.0 are run using TritonServer, and the concurrency is lower than that of the v0.10.0 version. The compilation scripts are the same.

Open white-wolf-tech opened this issue 1 year ago • 0 comments

System Info

GPU: A100

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

the following is v0.11.0 build script:

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --nccl_plugin disable \
                --paged_state disable \
                --tokens_per_block 16 \
                --use_custom_all_reduce disable

v0.10.0 is

python3 hf_convert_trtllm.py --model_dir $input_dir \
                                --output_dir $input_temp_dir \
                                --dtype float16 \
                                --calib_dataset $calib_dataset_path

    trtllm-build --checkpoint_dir $input_temp_dir \
                --output_dir $output_dir \
                --remove_input_padding enable \
                --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --paged_kv_cache  enable \
                --use_paged_context_fmha enable \
                --max_batch_size 128 \
                --max_input_len 2048 \
                --max_num_tokens 32768 \
                --tokens_per_block 16 \
                --use_fused_mlp \
                --context_fmha enable \
                --context_fmha_fp32_acc enable \
                --multi_block_mode enable \
                --use_custom_all_reduce disable \
                --strongly_typed

The GPU utilization rate after compilation of v0.11.0 is only 52% - 80%, but the utilization rate of v0.10.0 is stable at 99%. The resulting problem is that the concurrency of v0.11.0 is very low, only 1/2 of that of v0.10.0. What could this problem be? Has anyone encountered it?

Expected behavior

same with v0.10.0 version

actual behavior

slower than v0.10.0 version

additional notes

Aug 15 '24 02:08 white-wolf-tech