TensorRT-LLM
TensorRT-LLM copied to clipboard
Parameter --load_model_on_cpu is ignored by llama convert_checkpoint.py (version 0.9.0)
System Info
- OS: Ubuntu 22.04.4 LTS
- Nvidia driver version: 545.23.08
- CPU architecture: x86
- RAM size: ~500GB
- GPUs: 2xL40s 48GB
- Docker container image: manually compiled tensorrtllm_backend v0.9.0
- TensorRT-LLM version: 0.9.0
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
cd tensorrt_llm/examples/llama
python convert_checkpoint.py --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 --output_dir /output/mixtral-w4a16-tp2/ --dtype bfloat16 --tp_size 2 --use_weight_only --weight_only_precision int4 --moe_num_experts 8 --moe_top_k 2 --load_model_on_cpu
Expected behavior
The script should not use the GPUs
actual behavior
The script is using the GPUs, also leading to OOM
additional notes
On the same system, version 0.8.0 didn't have this problem.
Same issue here.
Sorry for the late reply, could you please try later versions? @fedem96 @ChristianPala This bug is fixed in later versions. Thanks!