TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Regarding the GPU memory usage and inference speed issues of the qwen2 0.5b model

Open GuangyanZhang opened this issue 1 year ago • 0 comments

cpu: x86_64 gpu: nvidia H20 cuda version: 12.4 tensorrt-llm version: 0.14.0 I follow https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/qwen/README.md running qwen2 0.5B model, The results I obtained are as follows:

  • case0 with code:

    • python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
      --output_dir ./tllm_checkpoint_1gpu_fp16
      --dtype float16
    • trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16
      --output_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
      --gemm_plugin float16
    • python ../summarize.py --test_trt_llm
      --hf_model_dir ./tmp/Qwen/0.5B/
      --data_type fp16
      --engine_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
      --max_input_length 2048
      --output_len 2048
  • case1 with code:

    • python convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/
      --output_dir ./tllm_checkpoint_1gpu_fp16_wq
      --dtype float16
      --use_weight_only
      --weight_only_precision int4
    • trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq
      --output_dir ./tmp/qwen/0.5B/trt_engines/weight_only/1-gpu/
      --gemm_plugin float16
    • python ../summarize.py --test_trt_llm
      --hf_model_dir ./tmp/Qwen/0.5B/
      --data_type fp16
      --engine_dir ./tmp/qwen/7B/trt_engines/weight_only/1-gpu/
      --max_input_length 2048
      --output_len 2048
  • case2 with code:

    • python3 convert_checkpoint.py --model_dir ./tmp/Qwen2/0.5B/ --output_dir ./tllm_checkpoint_1gpu_sq --dtype float16 --smoothquant 0.5
    • trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq
      --output_dir ./engine_outputs
      --gemm_plugin float16
    • python ../summarize.py --test_trt_llm
      --hf_model_dir ./tmp/Qwen2/0.5B/
      --data_type fp16
      --engine_dir ./engine_outputs
      --max_input_length 2048
      --output_len 2048
  • case3 with code:

    • python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
      --output_dir ./tllm_checkpoint_1gpu_fp16_int8kv --dtype float16
      --int8_kv_cache
    • trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq
      --output_dir ./engine_outputs
      --gemm_plugin float16
    • python ../summarize.py --test_trt_llm
      --hf_model_dir ./tmp/Qwen/0.5B/
      --data_type fp16
      --engine_dir ./engine_outputs
      --max_input_length 2048
      --output_len 2048
  • the result sum as:

  • case0: memory=87.60 lantency=581 tokens/s

  • case1: memory=87.50 lantency=533 tokens/s

  • cas2: memory=87.60 lantency=510 tokens/s

  • case3: memory=87.60 lantency=547 tokens/s

I have some question. 1:Why does the GPU memory usage seem so close, yet the difference from expectations is so large? 2.Why is the inference speed the fastest for fp16?

Thank you, I hope to receive your answer as soon as possible.

GuangyanZhang avatar Oct 10 '24 11:10 GuangyanZhang