TensorRT-LLM Fail to build w4a8

System Info

ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052100

nvidia A100

Who can help?

@Tracin @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

using w4a8_awq

python examples/quantization/quantize.py --model_dir /target/model/v19/hb_v19 \
                                       --dtype float16 \
                                       --qformat w4a8_awq \
                                       --awq_block_size 64 \
                                       --output_dir /target/model/quantized_w4a8-awq \
                                       --calib_size 32

build

trtllm-build    --checkpoint_dir /target/model/quantized_w4a8-awq \
                 --output_dir /target/model/trt_engines/w4a8_AWQ/1-gpu/ \
                 --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --context_fmha enable \
                --remove_input_padding enable \
                --paged_kv_cache enable \
                --max_batch_size 50 \
                --max_input_len 3000 \
                --max_output_len 3000

Expected behavior

trtllm-build could execute success

actual behavior

trtllm-build failed with following error:

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
[06/12/2024-03:24:42] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set nccl_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set lookup_plugin to None.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set lora_plugin to None.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set moe_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set context_fmha to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set remove_input_padding to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set multi_block_mode to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set enable_xqa to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set multiple_profiles to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set paged_state to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set streamingllm to False.
[06/12/2024-03:24:42] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[06/12/2024-03:24:42] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 496, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 377, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 336, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_model
    model = load_model(rank_config, ckpt_dir, model_cls)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1150, in load_model
    preprocess_weights(weights,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1013, in preprocess_weights
    weights[name] = preprocessor(param.T.contiguous(),
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The number of rows must be a multiple of 128 but the number of rows is 13632. (/target/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:528)
1       0x7fe16e5213ea tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fe3ac2bd06c tensorrt_llm::kernels::cutlass_kernels::interleave_column_major_tensor(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, tensorrt_llm::kernels::cutlass_kernels::LayoutDetails) + 732
3       0x7fe3ac2bd663 tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 963
4       0x7fe3ac29f271 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 609
5       0x7fe3ac2a5319 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 137
6       0x7fe22958a818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 568
7       0x7fe22931b4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::kwargs const&, std::optional<c10::DispatchKey>) + 451
8       0x7fe22931bd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) + 1329
9       0x7fe2291ff833 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x848833) [0x7fe2291ff833]
10      0x7fe228dcaea4 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7fe228dcaea4]
11      0x560e5abd3e0e /usr/bin/python(+0x15fe0e) [0x560e5abd3e0e]
12      0x560e5abe312b PyObject_Call + 187
13      0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
14      0x560e5abc9784 _PyObject_FastCallDictTstate + 196
15      0x560e5abdf54c _PyObject_Call_Prepend + 92
16      0x560e5acf81e0 /usr/bin/python(+0x2841e0) [0x560e5acf81e0]
17      0x560e5abca5eb _PyObject_MakeTpCall + 603
18      0x560e5abc2c66 _PyEval_EvalFrameDefault + 25878
19      0x560e5abd470c _PyFunction_Vectorcall + 124
20      0x560e5abbe0d1 _PyEval_EvalFrameDefault + 6529
21      0x560e5abd470c _PyFunction_Vectorcall + 124
22      0x560e5abbce0d _PyEval_EvalFrameDefault + 1725
23      0x560e5abd470c _PyFunction_Vectorcall + 124
24      0x560e5abe3192 PyObject_Call + 290
25      0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
26      0x560e5abd470c _PyFunction_Vectorcall + 124
27      0x560e5abe3192 PyObject_Call + 290
28      0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
29      0x560e5abd470c _PyFunction_Vectorcall + 124
30      0x560e5abe3192 PyObject_Call + 290
31      0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
32      0x560e5abd470c _PyFunction_Vectorcall + 124
33      0x560e5abbce0d _PyEval_EvalFrameDefault + 1725
34      0x560e5acade56 /usr/bin/python(+0x239e56) [0x560e5acade56]
35      0x560e5acadcf6 PyEval_EvalCode + 134
36      0x560e5acd87d8 /usr/bin/python(+0x2647d8) [0x560e5acd87d8]
37      0x560e5acd20bb /usr/bin/python(+0x25e0bb) [0x560e5acd20bb]
38      0x560e5acd8525 /usr/bin/python(+0x264525) [0x560e5acd8525]
39      0x560e5acd7a08 _PyRun_SimpleFileObject + 424
40      0x560e5acd7653 _PyRun_AnyFileObject + 67
41      0x560e5acca41e Py_RunMain + 702
42      0x560e5aca0cad Py_BytesMain + 45
43      0x7fe3b15f3d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fe3b15f3d90]
44      0x7fe3b15f3e40 __libc_start_main + 128
45      0x560e5aca0ba5 _start + 37

highlight: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The number of rows must be a multiple of 128 but the number of rows is 13632. (/target/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:528)

I think it should be 64 instead of 128, because I set awq_block_size =64

additional notes

My customed model is based on llama2-13B, intermediate_size =13632, so I set awq_block_size = 64 model confg:

{
    "producer": {
        "name": "modelopt",
        "version": "0.11.2"
    },
    "architecture": "LlamaForCausalLM",
    "dtype": "float16",
    "num_hidden_layers": 40,
    "num_attention_heads": 40,
    "num_key_value_heads": 8,
    "hidden_size": 5120,
    "norm_epsilon": 1e-06,
    "vocab_size": 50432,
    "max_position_embeddings": 4096,
    "hidden_act": "silu",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "quantization": {
        "quant_algo": "W4A8_AWQ",
        "kv_cache_quant_algo": null,
        "group_size": 64,
        "has_zero_point": false,
        "pre_quant_scale": true,
        "exclude_modules": [
            "lm_head"
        ]
    },
    "mapping": {
        "world_size": 1,
        "tp_size": 1,
        "pp_size": 1
    },
    "head_size": 128,
    "intermediate_size": 13632,
    "position_embedding_type": "rope_gpt_neox",
    "share_embedding_table": false,
    "residual_mlp": false,
    "bias": true,
    "rotary_pct": 1.0,
    "rank": 0,
    "decoder": "llama",
    "rmsnorm": false,
    "lm_head_bias": false,
    "rotary_base": 10000.0
}

quant config:

{'quant_cfg': {'*weight_quantizer': [{'num_bits': 4, 'block_sizes': {-1: 64, 'type': 'static'}, 'enable': True}, {'num_bits': (4, 3), 'axis': None, 'enable': True}], '*input_quantizer': {'num_bits': (4, 3), 'axis': -1, 'enable': True}, '*lm_head*': {'enable': False}, '*block_sparse_moe.gate*': {'enable': False}, '*output_layer*': {'enable': False}, 'default': {'enable': False}}, 'algorithm': 'awq_lite'}

LlamaDecoderLayer:

 LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QuantLinear(
            in_features=5120, out_features=5120, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.6914 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0176, 0.6953](409600) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=0.6953 calibrator=MaxCalibrator quant)
            )
          )
          (k_proj): QuantLinear(
            in_features=5120, out_features=1024, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.7852 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0266, 1.3594](81920) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=1.3594 calibrator=MaxCalibrator quant)
            )
          )
          (v_proj): QuantLinear(
            in_features=5120, out_features=1024, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.9883 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0072, 0.1631](81920) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=0.1631 calibrator=MaxCalibrator quant)
            )
          )
          (o_proj): QuantLinear(
            in_features=5120, out_features=5120, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.8555 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0152, 1.4062](409600) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=1.4062 calibrator=MaxCalibrator quant)
            )
          )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): QuantLinear(
            in_features=5120, out_features=13632, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=2.9375 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0029, 1.2734](1090560) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=1.2734 calibrator=MaxCalibrator quant)
            )
          )
          (up_proj): QuantLinear(
            in_features=5120, out_features=13632, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=3.9062 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0039, 0.7305](1090560) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=0.7305 calibrator=MaxCalibrator quant)
            )
          )
          (down_proj): QuantLinear(
            in_features=13632, out_features=5120, bias=True
            (input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=30.6250 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): SequentialQuantizer(
              (0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0053, 25.8750](1090560) calibrator=MaxCalibrator quant)
              (1): TensorQuantizer((4, 3) bit fake per-tensor amax=25.8750 calibrator=MaxCalibrator quant)
            )
          )
          (act_fn): SiLU()
        )
        (input_layernorm): LayerNorm((5120,), eps=1e-06, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((5120,), eps=1e-06, elementwise_affine=True)
      )

Jun 12 '24 03:06 Hongbosherlock

Hi @Hongbosherlock, there are several reasons for your case:

w4a8_awq only support Ada and Hopper, A100s are not supported.
w4a8_awq only support group_size = 128 at the moment.
For the assertion you encountered with, it's related to CUTLASS requirement but not awq_block_size, so the intermediate_size you used is not supported for 4-bit quantization.

Jun 12 '24 05:06 Barry-Delaney

w4a8_awq only support group_size = 128 at the moment.

Thanks for your rely. I will eventually use L40s for w4a8_awq inference. When I tried using v0.9 a month ago, I could successfully quantize my model using AMMO and get int4_awq and w4a8_awq engines (group_size = 64) finally. Inference with the int4_awq quantized model works fine. However, at that time, inference on Ada with w4a8_awq was not yet supported.

Now, in the new version w4a8_awq support Ada , the quantization tool is modelopt(name changed I think). However, using the same quantization configuration (group_size = 64) as before, I can't successfully get the w4a8_awq engine now.

I am very confused by the changes in the new version. Can you check it for me? Thanks.

Jun 12 '24 06:06 Hongbosherlock

@Barry-Delaney I think CUTLASS support group_size=64 for w4a8_awq at least two months ago, as seen in https://github.com/NVIDIA/cutlass/issues/1332

Jun 12 '24 07:06 Hongbosherlock

The assertion came from the preprocessor for interleaving.

In v0.9.0, the behavior for w4a8_awq on Ada is undefined, the successfully built engine doesn't mean it's correct for inference.
In v0.10.0, we supported Ada and w4a8_awq is specialized as an option, hence you will run into the restrictions added only for w4a8_awq.

Regarding the update in CUTLASS, supporting group_size=64 also requires several changes in TensorRT-LLM kernels, hence it's not addressed yet.

As for your customized model, at the moment, we suggest you try to padding the intermediate_size and change the group_size. Also, please feel free to start any feature requests in case you need.

Jun 12 '24 08:06 Barry-Delaney

The assertion came from the preprocessor for interleaving.

In v0.9.0, the behavior for w4a8_awq on Ada is undefined, the successfully built engine doesn't mean it's correct for inference.

In v0.10.0, we supported Ada and w4a8_awq is specialized as an option, hence you will run into the restrictions added only for w4a8_awq.

Regarding the update in CUTLASS, supporting group_size=64 also requires several changes in TensorRT-LLM kernels, hence it's not addressed yet.

As for your customized model, at the moment, we suggest you try to padding the intermediate_size and change the group_size. Also, please feel free to start any feature requests in case you need.

Now group size 64 and 128 supported for int4_awq , but only group size 128 support for w4a8_awq. Is group size 64 for w4a8_awq on the roadmap?

Jun 13 '24 06:06 Hongbosherlock

Afaik, it's not on the roadmap now.

Jun 14 '24 05:06 Barry-Delaney

Afaik, it's not on the roadmap now.

@Barry-Delaney When I try w4a8_awq on meta llama2-13b with intermediate_size=13824, I run the command:

python3 run.py --max_output_len=500 \
               --tokenizer_dir=/target/model/llama_13B \
               --engine_dir=/target/model/trt_engines/w4a8_AWQ/1-gpu/

but got errors:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: No valid weight only groupwise GEMM tactic(It is usually caused by the failure to execute all candidate configurations of the CUTLASS kernel, please pay attention to the warning information when building the engine.) (/target/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp:452)
1       0x7f4e9cd08073 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x56073) [0x7f4e9cd08073]
2       0x7f4e9cd9f9da tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 1786
3       0x7f4fa6d8cbec /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1060bec) [0x7f4fa6d8cbec]
4       0x7f4fa6d42217 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1016217) [0x7f4fa6d42217]
5       0x7f4fa6d43b79 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1017b79) [0x7f4fa6d43b79]
6       0x7f4ed311af34 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
7       0x7f4ed311b496 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 502
8       0x7f4ed3128094 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2164
9       0x7f4ed314c214 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 100
10      0x7f4ed314e49c tensorrt_llm::executor::Executor::Impl::executionLoop() + 380
11      0x7f508a4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f508a4b0253]
12      0x7f51348bfb43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f51348bfb43]
13      0x7f5134950bb4 clone + 68
[gajl-sys-sys-test11ef7ljc:26732] *** Process received signal ***
[gajl-sys-sys-test11ef7ljc:26732] Signal: Aborted (6)
[gajl-sys-sys-test11ef7ljc:26732] Signal code:  (-6)
[gajl-sys-sys-test11ef7ljc:26732] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f513486d520]
[gajl-sys-sys-test11ef7ljc:26732] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f51348c1a7c]
[gajl-sys-sys-test11ef7ljc:26732] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f513486d476]
[gajl-sys-sys-test11ef7ljc:26732] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f51348537f3]
[gajl-sys-sys-test11ef7ljc:26732] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f508a476b9e]
[gajl-sys-sys-test11ef7ljc:26732] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f508a48220c]
[gajl-sys-sys-test11ef7ljc:26732] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f508a4811e9]
[gajl-sys-sys-test11ef7ljc:26732] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f508a481959]
[gajl-sys-sys-test11ef7ljc:26732] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f5133d7e884]
[gajl-sys-sys-test11ef7ljc:26732] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f5133d7ef41]
[gajl-sys-sys-test11ef7ljc:26732] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f508a4824cb]
[gajl-sys-sys-test11ef7ljc:26732] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x560a1)[0x7f4e9cd080a1]
[gajl-sys-sys-test11ef7ljc:26732] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins36WeightOnlyGroupwiseQuantMatmulPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x6fa)[0x7f4e9cd9f9da]
[gajl-sys-sys-test11ef7ljc:26732] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1060bec)[0x7f4fa6d8cbec]
[gajl-sys-sys-test11ef7ljc:26732] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1016217)[0x7f4fa6d42217]
[gajl-sys-sys-test11ef7ljc:26732] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1017b79)[0x7f4fa6d43b79]
[gajl-sys-sys-test11ef7ljc:26732] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching14executeContextEi+0x34)[0x7f4ed311af34]
[gajl-sys-sys-test11ef7ljc:26732] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12executeBatchERKNS0_17ScheduledRequestsE+0x1f6)[0x7f4ed311b496]
[gajl-sys-sys-test11ef7ljc:26732] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12forwardAsyncERKSt4listISt10shared_ptrINS0_10LlmRequestEESaIS5_EE+0x874)[0x7f4ed3128094]
[gajl-sys-sys-test11ef7ljc:26732] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl12forwardAsyncERSt4listISt10shared_ptrINS_13batch_manager10LlmRequestEESaIS7_EE+0x64)[0x7f4ed314c214]
[gajl-sys-sys-test11ef7ljc:26732] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x17c)[0x7f4ed314e49c]
[gajl-sys-sys-test11ef7ljc:26732] [21] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f508a4b0253]
[gajl-sys-sys-test11ef7ljc:26732] [22] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f51348bfb43]
[gajl-sys-sys-test11ef7ljc:26732] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f5134950bb4]
[gajl-sys-sys-test11ef7ljc:26732] *** End of error message ***
Aborted (core dumped)

Jun 18 '24 08:06 Hongbosherlock

Are you using L40s, w4a8_awq and group_size = 128 for inference? Thx!

Jun 18 '24 09:06 Barry-Delaney

Are you using L40s, w4a8_awq and group_size = 128 for inference? Thx!

yes, I used the following command on L40s:

python examples/quantization/quantize.py --model_dir /target/model/llama_13B \
                                       --dtype float16 \
                                       --qformat w4a8_awq \
                                       --awq_block_size 128 \
                                       --output_dir /target/model/quantized_w4a8_awq_13b \
                                       --calib_size 32


trtllm-build    --checkpoint_dir /target/model/quantized_w4a8_awq_13b \
                 --output_dir /target/model/trt_engines/w4a8_AWQ/1-gpu/ \
                 --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --context_fmha enable \
                --remove_input_padding enable \
                --paged_kv_cache enable \
                --max_batch_size 50 \
                --max_input_len 3000 \
                --max_output_len 3000
``

Jun 18 '24 09:06 Hongbosherlock

I tried to reproduce with the command you provided, and it works fine on L40S. Here's what I got:

Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "chef in Paris before moving to London in 1837. He worked at the French restaurant, La Ville de Paris, in Regent Street, and then at the Café Royal in Regent Street. In 1846 he opened his own restaurant, the French House, in Soho, which was a great success.
In 1851 he was appointed chef at the Reform Club, where he remained for 20 years. He was famous for his elaborate dishes, which were often served in theatrical settings. He was also a pioneer of the use of gas for cooking.
Soyer’s cookbooks were very popular in their day, and he is considered one of the most important chefs of the 19th century.
1 What is the most famous French chef?
2 Who is the most famous chef in the world?
3 Who is the best chef in the world?
4 Who is the best chef in France?
5 Who is the best chef in the world 2022?
6 Who is the best chef in the world 2022 female?
7 Who is the best chef in the world 2022 male?
What is the most famous French chef?
There are many famous French chefs, but the most famous is probably Alain Ducasse. He has restaurants all over the world, and his cooking is considered some of the best in the world.
Who is the most famous chef in the world?
There is no one definitive answer to this question. However, there are a few chefs who have achieved a level of fame that is unmatched by their peers.
One of the most famous chefs in the world is Gordon Ramsay. He is a British chef who has achieved international fame for his fiery temper and his innovative cooking techniques. Ramsay has appeared on a number of television shows, including “Hell’s Kitchen” and “Kitchen Nightmares.”
Another famous chef is Jamie Oliver. He is a British chef who is known for his healthy cooking style. Oliver has appeared on a number of television shows, including “The Naked Chef” and “Jamie’s Food Revolution.”
Finally, there is Wolfgang Puck. He is an Austrian-born chef who is known for his"

Could you please try with the latest main branch and do a clean build?

Jun 19 '24 05:06 Barry-Delaney

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jul 20 '24 01:07 github-actions[bot]

Hi @Hongbosherlock do u still have further issue or question now? If not, we'll close it soon.

Nov 14 '24 02:11 nv-guomingz

Fail to build w4a8_awq on Llama 13b

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes