TensorRT-LLM torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

System Info

CPU architecture: X86
CPU/Host memory size: 500GB
GPU properties
GPU name: Nvidia V100
-GPU memory size: 32GB * 2 -Libraries -TensorRT-LLM branch : tag 0.16.0 -Versions of CUDA:12.1 -Container used : nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 -NVIDlA driver version: 530.30.02
OS: Ubuntu 24.04.1 LTS

Who can help?

@Tracin @kaiyux @byshiue

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Download triton-server nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3.
Download model weight Qwen2.5-32B-Instruct-GPTQ-Int4.
run convert_checkpoint.py with python3 convert_checkpoint.py --model_dir /root/models/Qwen2.5-32B-Instruct-GPTQ-Int4/ --output_dir /root/checkpoint/qwen2.5 --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group --tp_size 2

Expected behavior

conver success.

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
0.16.0
5it [00:01,  4.17it/s]
Traceback (most recent call last):
  File "/root/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 335, in <module>
    main()
  File "/root/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 327, in main
    convert_and_save_hf(args)
  File "/root/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 283, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 290, in execute
    f(args, rank)
  File "/root/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
    qwen = QWenForCausalLM.from_hugging_face(model_dir,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/qwen/model.py", line 438, in from_hugging_face
    loader.generate_tllm_weights(model, arg_dict)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 391, in generate_tllm_weights
    self.load(tllm_key,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 305, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/layers.py", line 1067, in postprocess
    return postprocess_weight_only_groupwise(tllm_key, weights, torch_dtype,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/functional.py", line 978, in postprocess_weight_only_groupwise
    qweight = torch.ops.trtllm.preprocess_weights_for_mixed_gemm(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1120, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Unsupported Arch (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:138)
1       0x7fcb8d7687df tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95
2       0x7fcd6862b771 tensorrt_llm::kernels::cutlass_kernels::getLayoutDetailsForTransform(tensorrt_llm::kernels::cutlass_kernels::QuantType, int) + 321
3       0x7fcd6862bc60 tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 240
4       0x7fcd68622964 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 644
5       0x7fcd686259d3 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 147
6       0x7fce3c790677 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x5312677) [0x7fce3c790677]
7       0x7fce447fcb7b torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) + 251
8       0x7fce447fcfcd torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) + 557
9       0x7fce446e0847 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8da847) [0x7fce446e0847]
10      0x7fce442773d5 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x4713d5) [0x7fce442773d5]
11            0x5820ff python3() [0x5820ff]
12            0x54b07c PyObject_Call + 156
13            0x5db68a _PyEval_EvalFrameDefault + 19514
14            0x54a712 _PyObject_Call_Prepend + 194
15            0x5a3698 python3() [0x5a3698]
16            0x548ec5 _PyObject_MakeTpCall + 117
17            0x5d74d9 _PyEval_EvalFrameDefault + 2697
18            0x54cae4 python3() [0x54cae4]
19            0x54b0f9 PyObject_Call + 281
20            0x5db68a _PyEval_EvalFrameDefault + 19514
21            0x54cae4 python3() [0x54cae4]
22            0x54b0f9 PyObject_Call + 281
23            0x5db68a _PyEval_EvalFrameDefault + 19514
24            0x5d59fb PyEval_EvalCode + 347
25            0x608b52 python3() [0x608b52]
26            0x6b4d83 python3() [0x6b4d83]
27            0x6b4aea _PyRun_SimpleFileObject + 426
28            0x6b491f _PyRun_AnyFileObject + 79
29            0x6bc985 Py_RunMain + 949
30            0x6bc46d Py_BytesMain + 45
31      0x7fce523c91ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7fce523c91ca]
32      0x7fce523c928b __libc_start_main + 139
33            0x657c15 _start + 37
Exception ignored in: <function PretrainedModel.__del__ at 0x7fcb39f23b00>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 607, in __del__
    self.release()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 604, in release
    release_gc()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_utils.py", line 533, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 968, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 338, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/root/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 7, in <module>
    from transformers import AutoConfig
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/transformers/__init__.py", line 26, in <module>
    from . import dependency_versions_check
  File "<frozen importlib._bootstrap>", line 1415, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/transformers/dependency_versions_check.py", line 16, in <module>
    from .utils.versions import require_version, require_version_core
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load

additional notes

No.

Jan 03 '25 07:01 Whisht

Volta is not supported since 0.14

Jan 07 '25 14:01 nv-guomingz

Thanks for your reply. Are there other ways that I can run Qwen2.5 with trtlllm in V100 GPU？

Jan 08 '25 09:01 Whisht

@nv-guomingz Hi，I got the same questions with this command when using RTX4090

python3 /tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py --model_dir /root/7b \
                              --output_dir /root/converted/7b/f16/1gpu-int4_gptq \
                              --dtype auto \
                              --use_weight_only \
                              --tp_size 1 \
                              --weight_only_precision int4_gptq

Traceback (most recent call last):
  File "/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 335, in <module>
    main()
  File "/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 327, in main
    convert_and_save_hf(args)
  File "/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 283, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 290, in execute
    f(args, rank)
  File "/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
    qwen = QWenForCausalLM.from_hugging_face(model_dir,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/qwen/model.py", line 438, in from_hugging_face
    loader.generate_tllm_weights(model, arg_dict)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 391, in generate_tllm_weights
    self.load(tllm_key,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 305, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/layers.py", line 1067, in postprocess
    return postprocess_weight_only_groupwise(tllm_key, weights, torch_dtype,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/functional.py", line 937, in postprocess_weight_only_groupwise
    torch.cat(weights[i::len(weights) // 3], dim=1)
TypeError: expected Tensor as element 0 in argument 0, but got NoneType
Exception ignored in: <function PretrainedModel.__del__ at 0x7f24e388fd80>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 607, in __del__
    self.release()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 604, in release
    release_gc()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_utils.py", line 533, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 968, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 338, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 7, in <module>
    from transformers import AutoConfig
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed

Jan 09 '25 15:01 Justin-12138

@Justin-12138 Hi, did you solve that? I also encounter that issue with RTX4090 =(

Mar 25 '25 01:03 DancingKitty

@DancingKitty Maybe you can try tthe command like this:

python3 /tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py  --model_dir /root/7b \
                              --output_dir /root/converted/7b/f16/1gpu \
                              --dtype float16 \
                              --use_weight_only \
                              --tp_size 1 \
                              --weight_only_precision int4

Mar 25 '25 01:03 Justin-12138

@Justin-12138 Thanks for the quick reply! I successfully converted it after setting envs using PR2658 & upgrading NumPy. Sincere thanks!

Mar 25 '25 03:03 DancingKitty