TypeError: weight_only_quantize() got an unexpected keyword argument 'group_size'
System Info
Using 1 a100 GPU. Using Nvidia-docker
slightly modified Dockerfile:
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
RUN apt update \
&& apt upgrade -y
RUN apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
RUN apt install git -y
RUN pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
RUN apt install git-lfs \
&& apt install zsh -y \
&& apt install wget -y
RUN sh -c "$(wget https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"
RUN pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
RUN pip install pynvml
WORKDIR /workspace
Who can help?
@trac
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Build and run the docker image.
- Navigate to the
examples/llamafolder - Try building a model
For example:
python3 build.py \
--model_dir meta-llama/Llama-2-7b-chat-hf \
--dtype bfloat16 \
--use_gpt_attention_plugin bfloat16 \
--use_gemm_plugin bfloat16 \
--remove_input_padding \
--use_inflight_batching \
--paged_kv_cache \
--output_dir llama7b_tensorrt_bfloat16_int4awq \
--use_weight_only \
--weight_only_precision int4_awq \
--max_batch_size 8 \
--enable_context_fmha \
--gpus_per_node 1 \
--max_output_len 2048 \
--parallel_build
It will fail
Traceback (most recent call last):
File "/workspace/TensorRT-LLM/examples/llama/build.py", line 906, in <module>
build(0, args)
File "/workspace/TensorRT-LLM/examples/llama/build.py", line 850, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/workspace/TensorRT-LLM/examples/llama/build.py", line 661, in build_rank_engine
tensorrt_llm_llama = quantize_model(tensorrt_llm_llama, args.quant_mode,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/quant.py", line 346, in quantize_model
model = weight_only_quantize(model, quant_mode, **kwargs)
TypeError: weight_only_quantize() got an unexpected keyword argument 'group_size'
Expected behavior
I am not sure if the input is supposed to be an existing GPTQ model, or if it will implement it(I think it is the latter)
Either way, some other warning(in the first case), or a GPTQ model engine should be made.
actual behavior
It errors out.
additional notes
I tried other quants like awq as well. Same issue.
If the issue is related to pytorch changes in docker image, I had to do that to solve another issue with Tensorrt-llm
Smooth quant seems broken as well.
Our latest main branch doesn't contain build.py under examples/llama path. Are u using a legacy version code base? Please refer to new workflow doc for details with our latest code.
I am using v0.7.1. The latest tag
Please try main branch if possible since our coming release also will use new build workflow.
I am using this software as well as tensorrtllm_backend.
I forget which project was having issues, but I was unable to build the docker image then.
I will try again for the quantized models. Bfloat16 seems to be working fine.
@nv-guomingz correct me if I am wrong but the tensort-llm currently is only compatible with TensorRT-LLM v0.7.1?
@mallorbc i got TensorRT-LLM v0.7.1 working with tensorrtllm_backend v0.7.2 with this docker run command
docker run --rm -it -p 0.0.0.0:8000:8000 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $(pwd)/all_models:/all_models \
-v $(pwd)/scripts:/opt/scripts \
-v ${HOME}/.cache/huggingface/:/root/.cache/huggingface/ \
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 bash
The last part 24.01 is important
Reason main version of TensorRT-LLM is not compatible with backend