Regarding the GPU memory usage and inference speed issues of the qwen2 0.5b model

Open GuangyanZhang opened this issue 1 year ago • 0 comments

cpu: x86_64 gpu: nvidia H20 cuda version： 12.4 tensorrt-llm version： 0.14.0 I follow https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/qwen/README.md running qwen2 0.5B model， The results I obtained are as follows：

case0 with code:
- python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
  --output_dir ./tllm_checkpoint_1gpu_fp16
  --dtype float16
- trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16
  --output_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
  --gemm_plugin float16
- python ../summarize.py --test_trt_llm
  --hf_model_dir ./tmp/Qwen/0.5B/
  --data_type fp16
  --engine_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
  --max_input_length 2048
  --output_len 2048
case1 with code:
- python convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/
  --output_dir ./tllm_checkpoint_1gpu_fp16_wq
  --dtype float16
  --use_weight_only
  --weight_only_precision int4
- trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq
  --output_dir ./tmp/qwen/0.5B/trt_engines/weight_only/1-gpu/
  --gemm_plugin float16
- python ../summarize.py --test_trt_llm
  --hf_model_dir ./tmp/Qwen/0.5B/
  --data_type fp16
  --engine_dir ./tmp/qwen/7B/trt_engines/weight_only/1-gpu/
  --max_input_length 2048
  --output_len 2048
case2 with code:
- python3 convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/ --output_dir ./tllm_checkpoint_1gpu_sq --dtype float16 --smoothquant 0.5
- trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq
  --output_dir ./engine_outputs
  --gemm_plugin float16
- python ../summarize.py --test_trt_llm
  --hf_model_dir ./tmp/Qwen2/0.5B/
  --data_type fp16
  --engine_dir ./engine_outputs
  --max_input_length 2048
  --output_len 2048
case3 with code:
- python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
  --output_dir ./tllm_checkpoint_1gpu_fp16_int8kv --dtype float16
  --int8_kv_cache
- trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq
  --output_dir ./engine_outputs
  --gemm_plugin float16
- python ../summarize.py --test_trt_llm
  --hf_model_dir ./tmp/Qwen/0.5B/
  --data_type fp16
  --engine_dir ./engine_outputs
  --max_input_length 2048
  --output_len 2048
the result sum as:
case0: memory=87.60 lantency=581 tokens/s
case1: memory=87.50 lantency=533 tokens/s
cas2: memory=87.60 lantency=510 tokens/s
case3: memory=87.60 lantency=547 tokens/s

I have some question. 1:Why does the GPU memory usage seem so close, yet the difference from expectations is so large？ 2.Why is the inference speed the fastest for fp16?

Thank you, I hope to receive your answer as soon as possible.

Oct 10 '24 11:10 GuangyanZhang