Parameter --load_model_on_cpu is ignored by llama convert_checkpoint.py (version 0.9.0)

Open fedem96 opened this issue 1 year ago • 1 comments

System Info

OS: Ubuntu 22.04.4 LTS
Nvidia driver version: 545.23.08
CPU architecture: x86
RAM size: ~500GB
GPUs: 2xL40s 48GB
Docker container image: manually compiled tensorrtllm_backend v0.9.0
TensorRT-LLM version: 0.9.0

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cd tensorrt_llm/examples/llama
python convert_checkpoint.py --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 --output_dir /output/mixtral-w4a16-tp2/ --dtype bfloat16 --tp_size 2 --use_weight_only --weight_only_precision int4 --moe_num_experts 8 --moe_top_k 2 --load_model_on_cpu

Expected behavior

The script should not use the GPUs

actual behavior

The script is using the GPUs, also leading to OOM

additional notes

On the same system, version 0.8.0 didn't have this problem.

Apr 26 '24 08:04 fedem96

Same issue here.

May 17 '24 08:05 ChristianPala

Sorry for the late reply, could you please try later versions? @fedem96 @ChristianPala This bug is fixed in later versions. Thanks!

Aug 22 '24 18:08 Barry-Delaney