inference
inference copied to clipboard
FEAT: Add command cal-model-mem
Implement model.llm.memory.estimate_llm_gpu_memory
which output model_mem, kv_cache, overhead, active_mem.
-
Download config.json from huggingface/modelscope and load model layers info
-
support kv_cache_dtype 8/16/32 (gpu_poor might only calculate fp32)
Algorithm refer to https://github.com/RahulSChand/gpu_poor
model.llm.utils: Add convert_model_size_to_float
Usage:
$ env HF_ENDPOINT=https://hf-mirror.com xinference cal-model-mem -s 7 -q Int4 -f gptq -c 16384 -n qwen1.5-chat
model_name: qwen1.5-chat
kv_cache_dtype: 16
model size: 7.0 B
quant: Int4
context: 16384
gpu mem usage:
model mem: 4139 MB
kv_cache: 8192 MB
overhead: 650 MB
active: 17024 MB
total: 30005 MB (30 GB)
$ env HF_ENDPOINT=https://hf-mirror.com xinference cal-model-mem -s 1_8 -q Int4 -f gptq -c 32768 -n qwen1.5-chat
model_name: qwen1.5-chat
kv_cache_dtype: 16
model size: 1.8 B
quant: Int4
context: 32768
gpu mem usage:
model mem: 1065 MB
kv_cache: 6144 MB
overhead: 650 MB
active: 33408 MB
total: 41267 MB (41 GB)