Unsupported auto parallel + int4 quantization on models
System Info
Tensorrt-LLM rel 0.9.0
Who can help?
@Tracin
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
In the 0.9.0 rel version, I try to apply auto parallel + quantization on the Llama 70B model, and both of them work fine independently.
I convert checkpoint using
python3 ./TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./llama-2/llama-2-70b --dtype float16 --output_dir ./model_profile_tmp/ckpt/2 --use_weight_only --weight_only_precision int4
and then I try auto parallel using
trtllm-build --checkpoint_dir ./model_profile_tmp/ckpt/2/ --gemm_plugin float16 --use_custom_all_reduce disable --output_dir ./model_profile_tmp/engine/2/ --workers 8 --max_batch_size 1 --auto_parallel 8 --weight_only_precision int4
Expected behavior
It should build the engine as float16,
actual behavior
However, it fails with:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 669, in build
model = optimize_model(model, use_unfused_qkv_gemm=use_auto_parallel)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 890, in optimize_model
model = unfuse_qkv_gemm(model)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 767, in unfuse_qkv_gemm
gemm.weight.value = weight
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (8192, 5120), original: (8192, 8192)
additional notes
Looks like auto parallel cannot deal with quantization which changes the dimension in trt
auto parallel does not support to work with quantization right now. Will add assertion to make the error information clearer.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Clearer error information has been added to the latest main branch, closing.
Please let us know if there are any questions, thanks.