TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Can not reach the Throughput value which described in your performance doc under fp16 llama7B

Open felixslu opened this issue 1 year ago • 4 comments

Background:

in the performance doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md mentioned: LLama7B , FP16 , batchsize:256 , input_len:128 output_len:128 ,A100 , reach a Throughput value of 5,353 tok/s/GPU。

Uploading image.png…

Problem:

on the same condition , We only reach to 435 tok/s/GPU.

Library:

NGC 24.01 TensorRT-LLM - v0.7.0 branch

our build command

python3 build.py --model_dir $llm_model_path \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_inflight_batching \ --enable_context_fmha \ --use_gemm_plugin float16 \ --paged_kv_cache \ --use_fused_mlp \ --max_batch_size 256 \ --output_dir $llm_model_path/trt_engines/bs256/fp16/1-gpu/

our running param

` MAX_BATCH_SIZE=512
python3 tools/fill_template.py -i $proj_model_repo/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,preprocessing_instance_count:1

python3 tools/fill_template.py -i $proj_model_repo/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,postprocessing_instance_count:1

python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i $proj_model_repo/ensemble/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE

python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,max_beam_width:1,engine_dir:$MODEL_DIR,max_attention_window_size:8192,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,enable_trt_overlap:False,max_queue_delay_microseconds:600

`

We have not set set "--max_tokens_in_paged_kvcache", just use "kv_cache_free_gpu_mem_fraction:0.9" to consume rest memory for kv-cache.

our GPU MEN

TOTAL:78.4GB kv-Cache MEM: 27.4GB

image

"[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size" **So,maybe the max_num_sequences param have been set to 256

image

our machine

A800
**Use only one GPU Card **

our test script

cd triton_backend/tools/inflight_batcher_llm/ && python3 benchmark_core_model.py -i grpc --request-rate 4 --max-input-len 1024 --num-requests 50 --exclude-input-in-output token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 128 --output-stdev 2

our result

image

Question:

Why the gap of Throughput is so huge? Could you share your test method or show me our problem?

felixslu avatar Mar 08 '24 03:03 felixslu

max_tokens_in_paged_kv_cache is define here.

max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.

If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.

byshiue avatar Mar 12 '24 03:03 byshiue

max_tokens_in_paged_kv_cache is define here.

max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.

If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.

As mentioned in doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md

Firstly, We set batchsize to 256 for engine building stage. Then, We set kv_cache_free_gpu_mem_fraction to 0.95. Finally, the total MEM is 78.4G and KV-Cache MEM is 27.4G on A100. and we believe we allocate all MEM to engine and kv-cache as much as popssible. But the actually throughput value is 435 tok/s/GPU, far bellow your evaluation value of 5,353 tok/s/GPU.

How can We reproduct your performamce ? Could you share your command and triton-trtllm parameters?

Unless users clearly know the maximum number of tokens in the KV cache needed by the model, it is recommended to leave max_tokens_in_paged_kv_cache unset. For kv_cache_free_gpu_mem_fraction, if no other programs are executed on the same GPU, it is recommended to test with a as high value as 0.95 to target a high throughput. Note that the kv_cache_free_gpu_mem_fraction parameter cannot be set to 1.0 because some amount of memory has to be reserved for inputs and outputs.

felixslu avatar Mar 13 '24 06:03 felixslu

I'm confused too

jaywongs avatar Apr 23 '24 08:04 jaywongs

Please share your steps to get the performance.

byshiue avatar Apr 24 '24 23:04 byshiue