[WIP] Reduce peak memory to 8 GB

Open andy-yang-1 opened this issue 2 years ago • 0 comments

When we apply delta, we load two complete models at the same time, which puts a lot of strain on the CPU memory. This PR allows us to apply delta within 10GB of memory.

run with

python3 -m fastchat.model.apply_delta \
  --base /path/to/llama-13b \
  --target /output/path/to/vicuna-13b \
  --delta lmsys/vicuna-13b-delta-v0

Apr 13 '23 06:04 andy-yang-1