TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Unsupported auto parallel + int4 quantization on models

Open Hudayday opened this issue 1 year ago • 1 comments

System Info

Tensorrt-LLM rel 0.9.0

Who can help?

@Tracin

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

In the 0.9.0 rel version, I try to apply auto parallel + quantization on the Llama 70B model, and both of them work fine independently.

I convert checkpoint using

python3 ./TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./llama-2/llama-2-70b --dtype float16 --output_dir ./model_profile_tmp/ckpt/2 --use_weight_only --weight_only_precision int4

and then I try auto parallel using

trtllm-build --checkpoint_dir ./model_profile_tmp/ckpt/2/ --gemm_plugin float16 --use_custom_all_reduce disable --output_dir ./model_profile_tmp/engine/2/ --workers 8 --max_batch_size 1 --auto_parallel 8 --weight_only_precision int4

Expected behavior

It should build the engine as float16,

actual behavior

However, it fails with:

concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 669, in build model = optimize_model(model, use_unfused_qkv_gemm=use_auto_parallel) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 890, in optimize_model model = unfuse_qkv_gemm(model) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 767, in unfuse_qkv_gemm gemm.weight.value = weight File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (8192, 5120), original: (8192, 8192)

additional notes

Looks like auto parallel cannot deal with quantization which changes the dimension in trt

Hudayday avatar May 19 '24 12:05 Hudayday

auto parallel does not support to work with quantization right now. Will add assertion to make the error information clearer.

yuxianq avatar May 21 '24 09:05 yuxianq

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Jun 23 '24 01:06 github-actions[bot]

Clearer error information has been added to the latest main branch, closing.

Please let us know if there are any questions, thanks.

kaiyux avatar Jun 25 '24 13:06 kaiyux