Can not reach the Throughput value which described in your performance doc under fp16 llama7B
Background:
in the performance doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md mentioned: LLama7B , FP16 , batchsize:256 , input_len:128 output_len:128 ,A100 , reach a Throughput value of 5,353 tok/s/GPU。
Problem:
on the same condition , We only reach to 435 tok/s/GPU.
Library:
NGC 24.01 TensorRT-LLM - v0.7.0 branch
our build command
python3 build.py --model_dir $llm_model_path \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_inflight_batching \ --enable_context_fmha \ --use_gemm_plugin float16 \ --paged_kv_cache \ --use_fused_mlp \ --max_batch_size 256 \ --output_dir $llm_model_path/trt_engines/bs256/fp16/1-gpu/
our running param
`
MAX_BATCH_SIZE=512
python3 tools/fill_template.py -i $proj_model_repo/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,preprocessing_instance_count:1
python3 tools/fill_template.py -i $proj_model_repo/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,postprocessing_instance_count:1
python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i $proj_model_repo/ensemble/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE
python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,max_beam_width:1,engine_dir:$MODEL_DIR,max_attention_window_size:8192,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,enable_trt_overlap:False,max_queue_delay_microseconds:600
`
We have not set set "--max_tokens_in_paged_kvcache", just use "kv_cache_free_gpu_mem_fraction:0.9" to consume rest memory for kv-cache.
our GPU MEN
TOTAL:78.4GB kv-Cache MEM: 27.4GB
"[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size" **So,maybe the max_num_sequences param have been set to 256
our machine
A800
**Use only one GPU Card **
our test script
cd triton_backend/tools/inflight_batcher_llm/ && python3 benchmark_core_model.py -i grpc --request-rate 4 --max-input-len 1024 --num-requests 50 --exclude-input-in-output token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 128 --output-stdev 2
our result
Question:
Why the gap of Throughput is so huge? Could you share your test method or show me our problem?
max_tokens_in_paged_kv_cache is define here.
max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.
If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.
max_tokens_in_paged_kv_cacheis define here.
max_tokens_in_paged_kv_cacheandkv_cache_free_gpu_mem_fractioncontroll the kv cache memory usage together. More details are described here.If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.
As mentioned in doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md
Firstly, We set batchsize to 256 for engine building stage. Then, We set kv_cache_free_gpu_mem_fraction to 0.95. Finally, the total MEM is 78.4G and KV-Cache MEM is 27.4G on A100. and we believe we allocate all MEM to engine and kv-cache as much as popssible. But the actually throughput value is 435 tok/s/GPU, far bellow your evaluation value of 5,353 tok/s/GPU.
How can We reproduct your performamce ? Could you share your command and triton-trtllm parameters?
Unless users clearly know the maximum number of tokens in the KV cache needed by the model, it is recommended to leave max_tokens_in_paged_kv_cache unset. For kv_cache_free_gpu_mem_fraction, if no other programs are executed on the same GPU, it is recommended to test with a as high value as 0.95 to target a high throughput. Note that the kv_cache_free_gpu_mem_fraction parameter cannot be set to 1.0 because some amount of memory has to be reserved for inputs and outputs.
I'm confused too
Please share your steps to get the performance.