TensorRT-LLM Unsupported auto parallel + int4 quantization on models

System Info

Tensorrt-LLM rel 0.9.0

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

In the 0.9.0 rel version, I try to apply auto parallel + quantization on the Llama 70B model, and both of them work fine independently.

I convert checkpoint using

python3 ./TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./llama-2/llama-2-70b --dtype float16 --output_dir ./model_profile_tmp/ckpt/2 --use_weight_only --weight_only_precision int4

and then I try auto parallel using

trtllm-build --checkpoint_dir ./model_profile_tmp/ckpt/2/ --gemm_plugin float16 --use_custom_all_reduce disable --output_dir ./model_profile_tmp/engine/2/ --workers 8 --max_batch_size 1 --auto_parallel 8 --weight_only_precision int4

Expected behavior

It should build the engine as float16,

actual behavior

However, it fails with:

concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 669, in build model = optimize_model(model, use_unfused_qkv_gemm=use_auto_parallel) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 890, in optimize_model model = unfuse_qkv_gemm(model) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 767, in unfuse_qkv_gemm gemm.weight.value = weight File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (8192, 5120), original: (8192, 8192)

additional notes

Looks like auto parallel cannot deal with quantization which changes the dimension in trt

May 19 '24 12:05 Hudayday

auto parallel does not support to work with quantization right now. Will add assertion to make the error information clearer.

May 21 '24 09:05 yuxianq

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jun 23 '24 01:06 github-actions[bot]

Clearer error information has been added to the latest main branch, closing.

Please let us know if there are any questions, thanks.

Jun 25 '24 13:06 kaiyux