inference icon indicating copy to clipboard operation
inference copied to clipboard

FEAT: Add command cal-model-mem

Open frostyplanet opened this issue 1 year ago • 0 comments

Implement model.llm.memory.estimate_llm_gpu_memory

which output model_mem, kv_cache, overhead, active_mem.

  • Download config.json from huggingface/modelscope and load model layers info

  • support kv_cache_dtype 8/16/32 (gpu_poor might only calculate fp32)

Algorithm refer to https://github.com/RahulSChand/gpu_poor

model.llm.utils: Add convert_model_size_to_float

Usage:

$ env HF_ENDPOINT=https://hf-mirror.com xinference cal-model-mem -s 7 -q Int4 -f gptq -c 16384 -n qwen1.5-chat
model_name: qwen1.5-chat
kv_cache_dtype: 16
model size: 7.0 B
quant: Int4
context: 16384
gpu mem usage:
  model mem: 4139 MB
  kv_cache: 8192 MB
  overhead: 650 MB
  active: 17024 MB
  total: 30005 MB (30 GB)

$ env HF_ENDPOINT=https://hf-mirror.com xinference cal-model-mem -s 1_8 -q Int4 -f gptq -c 32768 -n qwen1.5-chat
model_name: qwen1.5-chat
kv_cache_dtype: 16
model size: 1.8 B
quant: Int4
context: 32768
gpu mem usage:
  model mem: 1065 MB
  kv_cache: 6144 MB
  overhead: 650 MB
  active: 33408 MB
  total: 41267 MB (41 GB)


frostyplanet avatar May 09 '24 08:05 frostyplanet