TensorRT-LLM
TensorRT-LLM copied to clipboard
Regarding the GPU memory usage and inference speed issues of the qwen2 0.5b model
cpu: x86_64 gpu: nvidia H20 cuda version: 12.4 tensorrt-llm version: 0.14.0 I follow https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/qwen/README.md running qwen2 0.5B model, The results I obtained are as follows:
-
case0 with code:
- python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
--output_dir ./tllm_checkpoint_1gpu_fp16
--dtype float16 - trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16
--output_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
--gemm_plugin float16 - python ../summarize.py --test_trt_llm
--hf_model_dir ./tmp/Qwen/0.5B/
--data_type fp16
--engine_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
--max_input_length 2048
--output_len 2048
- python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
-
case1 with code:
- python convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/
--output_dir ./tllm_checkpoint_1gpu_fp16_wq
--dtype float16
--use_weight_only
--weight_only_precision int4 - trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq
--output_dir ./tmp/qwen/0.5B/trt_engines/weight_only/1-gpu/
--gemm_plugin float16 - python ../summarize.py --test_trt_llm
--hf_model_dir ./tmp/Qwen/0.5B/
--data_type fp16
--engine_dir ./tmp/qwen/7B/trt_engines/weight_only/1-gpu/
--max_input_length 2048
--output_len 2048
- python convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/
-
case2 with code:
- python3 convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/ --output_dir ./tllm_checkpoint_1gpu_sq --dtype float16 --smoothquant 0.5
- trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq
--output_dir ./engine_outputs
--gemm_plugin float16 - python ../summarize.py --test_trt_llm
--hf_model_dir ./tmp/Qwen2/0.5B/
--data_type fp16
--engine_dir ./engine_outputs
--max_input_length 2048
--output_len 2048
-
case3 with code:
- python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
--output_dir ./tllm_checkpoint_1gpu_fp16_int8kv --dtype float16
--int8_kv_cache - trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq
--output_dir ./engine_outputs
--gemm_plugin float16 - python ../summarize.py --test_trt_llm
--hf_model_dir ./tmp/Qwen/0.5B/
--data_type fp16
--engine_dir ./engine_outputs
--max_input_length 2048
--output_len 2048
- python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
-
the result sum as:
-
case0: memory=87.60 lantency=581 tokens/s
-
case1: memory=87.50 lantency=533 tokens/s
-
cas2: memory=87.60 lantency=510 tokens/s
-
case3: memory=87.60 lantency=547 tokens/s
I have some question. 1:Why does the GPU memory usage seem so close, yet the difference from expectations is so large? 2.Why is the inference speed the fastest for fp16?
Thank you, I hope to receive your answer as soon as possible.