TensorRT-LLM TypeError: weight_only_quantize() got an unexpected keyword argument 'group

System Info

Using 1 a100 GPU. Using Nvidia-docker

slightly modified Dockerfile:

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt update \
&& apt upgrade -y

RUN apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

RUN apt install git -y

RUN pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

RUN apt install git-lfs \
&& apt install zsh -y \
&& apt install wget -y 

RUN sh -c "$(wget https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"

RUN pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

RUN pip install pynvml

WORKDIR /workspace

Who can help?

@trac

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build and run the docker image.
Navigate to the examples/llama folder
Try building a model

For example:

python3 build.py \
    --model_dir meta-llama/Llama-2-7b-chat-hf \
    --dtype bfloat16 \
    --use_gpt_attention_plugin bfloat16 \
    --use_gemm_plugin bfloat16 \
    --remove_input_padding \
    --use_inflight_batching \
    --paged_kv_cache \
    --output_dir llama7b_tensorrt_bfloat16_int4awq \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --max_batch_size 8 \
    --enable_context_fmha \
    --gpus_per_node 1 \
    --max_output_len 2048 \
    --parallel_build

It will fail

Traceback (most recent call last):
  File "/workspace/TensorRT-LLM/examples/llama/build.py", line 906, in <module>
    build(0, args)
  File "/workspace/TensorRT-LLM/examples/llama/build.py", line 850, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/workspace/TensorRT-LLM/examples/llama/build.py", line 661, in build_rank_engine
    tensorrt_llm_llama = quantize_model(tensorrt_llm_llama, args.quant_mode,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/quant.py", line 346, in quantize_model
    model = weight_only_quantize(model, quant_mode, **kwargs)
TypeError: weight_only_quantize() got an unexpected keyword argument 'group_size'

Expected behavior

I am not sure if the input is supposed to be an existing GPTQ model, or if it will implement it(I think it is the latter)

Either way, some other warning(in the first case), or a GPTQ model engine should be made.

actual behavior

It errors out.

additional notes

I tried other quants like awq as well. Same issue.

If the issue is related to pytorch changes in docker image, I had to do that to solve another issue with Tensorrt-llm

Feb 08 '24 07:02 mallorbc

Smooth quant seems broken as well.

Feb 08 '24 07:02 mallorbc

Our latest main branch doesn't contain build.py under examples/llama path. Are u using a legacy version code base? Please refer to new workflow doc for details with our latest code.

Feb 08 '24 08:02 nv-guomingz

I am using v0.7.1. The latest tag

Feb 08 '24 08:02 mallorbc

Please try main branch if possible since our coming release also will use new build workflow.

Feb 08 '24 08:02 nv-guomingz

I am using this software as well as tensorrtllm_backend.

I forget which project was having issues, but I was unable to build the docker image then.

I will try again for the quantized models. Bfloat16 seems to be working fine.

Feb 08 '24 08:02 mallorbc

@nv-guomingz correct me if I am wrong but the tensort-llm currently is only compatible with TensorRT-LLM v0.7.1?

@mallorbc i got TensorRT-LLM v0.7.1 working with tensorrtllm_backend v0.7.2 with this docker run command

docker run --rm -it -p 0.0.0.0:8000:8000 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $(pwd)/all_models:/all_models \
-v $(pwd)/scripts:/opt/scripts \
-v ${HOME}/.cache/huggingface/:/root/.cache/huggingface/ \
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 bash

The last part 24.01 is important

Reason main version of TensorRT-LLM is not compatible with backend

Feb 09 '24 17:02 enochlev