TensorRT-LLM qwen1.5-7b---Why do I need 37GB of GPU memory

System Info

nvidia A100 PCIE 40g TensorRT-LLM version: 0.12.0.dev2024070200 python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16
--output_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu
--gemm_plugin float16 --max_batch_size 1
--max_input_len 1 --max_seq_len 3 --max_num_tokens 1

When I run 0.12.0 tensorrt-llm-qwen1.5-7b, it requires 37GB of GPU memory. nvidia A100 PCIE 40g N/A 32C P0 40W / 250W | 37957MiB / 40960MiB |

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

1.Excessive GPU memory，0.12.0 trt-llm 37g GPU memory 2.0.8.0 trt-llm qwen1.5-7b only 17g GPU memory

Expected behavior

0.12.0 trtllm qwen1.5-7b has 37g GPU memory，I want this version to reduce GPU memory

actual behavior

0.12.0 trtllm qwen1.5-7b has 37g GPU memory，I want this version to reduce GPU memory

additional notes

0.12.0 trtllm qwen1.5-7b has 37g GPU memory，I want this version to reduce GPU memory

Jul 26 '24 10:07 xiangxinhello

The default value of the kv_cache_free_gpu_memory_fraction parameter is 0.9. This means that the entire runner will use approximately 90% of the GPU memory, with a portion allocated for the normal operation of the user model and another portion reserved for the kv cache.

Jul 30 '24 10:07 vonchenplus

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Sep 05 '24 01:09 github-actions[bot]