TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

qwen1.5-7b---Why do I need 37GB of GPU memory

Open xiangxinhello opened this issue 1 year ago • 1 comments

System Info

nvidia A100 PCIE 40g TensorRT-LLM version: 0.12.0.dev2024070200 python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16
--output_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
--gemm_plugin float16 --max_batch_size 1
--max_input_len 1 --max_seq_len 3 --max_num_tokens 1

When I run 0.12.0 tensorrt-llm-qwen1.5-7b, it requires 37GB of GPU memory. nvidia A100 PCIE 40g N/A 32C P0 40W / 250W | 37957MiB / 40960MiB |

Who can help?

No response

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

1.Excessive GPU memory,0.12.0 trt-llm 37g GPU memory 2.0.8.0 trt-llm qwen1.5-7b only 17g GPU memory

Expected behavior

0.12.0 trtllm qwen1.5-7b has 37g GPU memory,I want this version to reduce GPU memory

actual behavior

0.12.0 trtllm qwen1.5-7b has 37g GPU memory,I want this version to reduce GPU memory

additional notes

0.12.0 trtllm qwen1.5-7b has 37g GPU memory,I want this version to reduce GPU memory

xiangxinhello avatar Jul 26 '24 10:07 xiangxinhello

The default value of the kv_cache_free_gpu_memory_fraction parameter is 0.9. This means that the entire runner will use approximately 90% of the GPU memory, with a portion allocated for the normal operation of the user model and another portion reserved for the kv cache.

vonchenplus avatar Jul 30 '24 10:07 vonchenplus

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Sep 05 '24 01:09 github-actions[bot]