TensorRT-LLM ammo_cuda_ext and ammo_cuda_ext

System Info

CPU: Intel(R) Xeon(R) Platinum 8369B, GPU: a single NVIDIA A10, Driver Version: 550.54.14, CUDA Version: 12.4, NVCC Version: 12.1.105, TensorRT-LLM Version: 0.9.0.dev2024022700, nvidia-ammo Version: 0.7.4

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python ../quantization/quantize.py  \
--model_dir ~/.cache/modelscope/hub/ZhipuAI/chatglm3-6b/  \
--dtype float16  \
--qformat int4_awq  \
--output_dir trt_ckpt/chatglm3_6b/int4_awq/1-gpu

Expected behavior

ammo_cuda_ext and ammo_cuda_ext_fp8 built successfully and use GPU to do the quantization.

actual behavior

Loading extension ammo_cuda_ext...
[NeMo W 2024-03-24 23:19:07 nemo_logging:349] /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/utils/cpp_extension.py:57: UserWarning: Error building extension 'ammo_cuda_ext': [1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=ammo_cuda_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/TH -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/cac/miniconda3/envs/trtllm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu -o tensor_quant_gpu.cuda.o 
    FAILED: tensor_quant_gpu.cuda.o 
    /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=ammo_cuda_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/TH -isystem /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/cac/miniconda3/envs/trtllm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu -o tensor_quant_gpu.cuda.o 
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:23: warning: "AT_DISPATCH_CASE_FLOATING_TYPES" redefined
       23 | #define AT_DISPATCH_CASE_FLOATING_TYPES(...)                                   \
          | 
    In file included from /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/ATen/ATen.h:11,
                     from /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:13:
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/ATen/Dispatch.h:232: note: this is the location of the previous definition
      232 | #define AT_DISPATCH_CASE_FLOATING_TYPES(...)            \
          | 
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:23: warning: "AT_DISPATCH_CASE_FLOATING_TYPES" redefined
       23 | #define AT_DISPATCH_CASE_FLOATING_TYPES(...)                                   \
          | 
    In file included from /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/ATen/ATen.h:11,
                     from /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/ammo/torch/quantization/src/tensor_quant_gpu.cu:13:
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/ATen/Dispatch.h:232: note: this is the location of the previous definition
      232 | #define AT_DISPATCH_CASE_FLOATING_TYPES(...)            \
          | 
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/pybind11/cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/pybind11/cast.h:45:120: error: expected template-name before ‘<’ token
       45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
          |                                                                                                                        ^
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/pybind11/cast.h:45:120: error: expected identifier before ‘<’ token
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/pybind11/cast.h:45:123: error: expected primary-expression before ‘>’ token
       45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
          |                                                                                                                           ^
    /home/cac/miniconda3/envs/trtllm/lib/python3.10/site-packages/torch/include/pybind11/cast.h:45:126: error: expected primary-expression before ‘)’ token
       45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
          |                                                                                                                              ^
    ninja: build stopped: subcommand failed.
    
    Unable to load extension ammo_cuda_ext and falling back to CPU version.
      warnings.warn(f"{e}\nUnable to load extension {name} and falling back to CPU version.")

additional notes

Are there any conflicts of software versions?

Mar 24 '24 15:03 Phoveran

Hi, did you resolve the issue?

Apr 30 '24 05:04 sam-india-007

Not working on it.

May 23 '24 06:05 Phoveran

ammo_cuda_ext and ammo_cuda_ext_fp8 building failed

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes