qwen1.5-7b---Why do I need 37GB of GPU memory
System Info
nvidia A100 PCIE 40g TensorRT-LLM version: 0.12.0.dev2024070200 python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16
--output_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
--gemm_plugin float16
--max_batch_size 1
--max_input_len 1
--max_seq_len 3
--max_num_tokens 1
When I run 0.12.0 tensorrt-llm-qwen1.5-7b, it requires 37GB of GPU memory. nvidia A100 PCIE 40g N/A 32C P0 40W / 250W | 37957MiB / 40960MiB |
Who can help?
No response
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
1.Excessive GPU memory,0.12.0 trt-llm 37g GPU memory 2.0.8.0 trt-llm qwen1.5-7b only 17g GPU memory
Expected behavior
0.12.0 trtllm qwen1.5-7b has 37g GPU memory,I want this version to reduce GPU memory
actual behavior
0.12.0 trtllm qwen1.5-7b has 37g GPU memory,I want this version to reduce GPU memory
additional notes
0.12.0 trtllm qwen1.5-7b has 37g GPU memory,I want this version to reduce GPU memory
The default value of the kv_cache_free_gpu_memory_fraction parameter is 0.9. This means that the entire runner will use approximately 90% of the GPU memory, with a portion allocated for the normal operation of the user model and another portion reserved for the kv cache.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."