Fail to build w4a8_awq on Llama 13b
System Info
ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052100
nvidia A100
Who can help?
@Tracin @byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
using w4a8_awq
python examples/quantization/quantize.py --model_dir /target/model/v19/hb_v19 \
--dtype float16 \
--qformat w4a8_awq \
--awq_block_size 64 \
--output_dir /target/model/quantized_w4a8-awq \
--calib_size 32
build
trtllm-build --checkpoint_dir /target/model/quantized_w4a8-awq \
--output_dir /target/model/trt_engines/w4a8_AWQ/1-gpu/ \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--context_fmha enable \
--remove_input_padding enable \
--paged_kv_cache enable \
--max_batch_size 50 \
--max_input_len 3000 \
--max_output_len 3000
Expected behavior
trtllm-build could execute success
actual behavior
trtllm-build failed with following error:
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
[06/12/2024-03:24:42] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set nccl_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set lookup_plugin to None.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set lora_plugin to None.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set moe_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set context_fmha to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set remove_input_padding to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set multi_block_mode to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set enable_xqa to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set multiple_profiles to False.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set paged_state to True.
[06/12/2024-03:24:42] [TRT-LLM] [I] Set streamingllm to False.
[06/12/2024-03:24:42] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len.
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[06/12/2024-03:24:42] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 496, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 377, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 336, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1150, in load_model
preprocess_weights(weights,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1013, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
return self._op(*args, **(kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The number of rows must be a multiple of 128 but the number of rows is 13632. (/target/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:528)
1 0x7fe16e5213ea tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fe3ac2bd06c tensorrt_llm::kernels::cutlass_kernels::interleave_column_major_tensor(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, tensorrt_llm::kernels::cutlass_kernels::LayoutDetails) + 732
3 0x7fe3ac2bd663 tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 963
4 0x7fe3ac29f271 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 609
5 0x7fe3ac2a5319 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 137
6 0x7fe22958a818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 568
7 0x7fe22931b4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::kwargs const&, std::optional<c10::DispatchKey>) + 451
8 0x7fe22931bd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) + 1329
9 0x7fe2291ff833 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x848833) [0x7fe2291ff833]
10 0x7fe228dcaea4 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7fe228dcaea4]
11 0x560e5abd3e0e /usr/bin/python(+0x15fe0e) [0x560e5abd3e0e]
12 0x560e5abe312b PyObject_Call + 187
13 0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
14 0x560e5abc9784 _PyObject_FastCallDictTstate + 196
15 0x560e5abdf54c _PyObject_Call_Prepend + 92
16 0x560e5acf81e0 /usr/bin/python(+0x2841e0) [0x560e5acf81e0]
17 0x560e5abca5eb _PyObject_MakeTpCall + 603
18 0x560e5abc2c66 _PyEval_EvalFrameDefault + 25878
19 0x560e5abd470c _PyFunction_Vectorcall + 124
20 0x560e5abbe0d1 _PyEval_EvalFrameDefault + 6529
21 0x560e5abd470c _PyFunction_Vectorcall + 124
22 0x560e5abbce0d _PyEval_EvalFrameDefault + 1725
23 0x560e5abd470c _PyFunction_Vectorcall + 124
24 0x560e5abe3192 PyObject_Call + 290
25 0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
26 0x560e5abd470c _PyFunction_Vectorcall + 124
27 0x560e5abe3192 PyObject_Call + 290
28 0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
29 0x560e5abd470c _PyFunction_Vectorcall + 124
30 0x560e5abe3192 PyObject_Call + 290
31 0x560e5abbf2c1 _PyEval_EvalFrameDefault + 11121
32 0x560e5abd470c _PyFunction_Vectorcall + 124
33 0x560e5abbce0d _PyEval_EvalFrameDefault + 1725
34 0x560e5acade56 /usr/bin/python(+0x239e56) [0x560e5acade56]
35 0x560e5acadcf6 PyEval_EvalCode + 134
36 0x560e5acd87d8 /usr/bin/python(+0x2647d8) [0x560e5acd87d8]
37 0x560e5acd20bb /usr/bin/python(+0x25e0bb) [0x560e5acd20bb]
38 0x560e5acd8525 /usr/bin/python(+0x264525) [0x560e5acd8525]
39 0x560e5acd7a08 _PyRun_SimpleFileObject + 424
40 0x560e5acd7653 _PyRun_AnyFileObject + 67
41 0x560e5acca41e Py_RunMain + 702
42 0x560e5aca0cad Py_BytesMain + 45
43 0x7fe3b15f3d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fe3b15f3d90]
44 0x7fe3b15f3e40 __libc_start_main + 128
45 0x560e5aca0ba5 _start + 37
highlight: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: The number of rows must be a multiple of 128 but the number of rows is 13632. (/target/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:528)
I think it should be 64 instead of 128, because I set awq_block_size =64
additional notes
My customed model is based on llama2-13B, intermediate_size =13632, so I set awq_block_size = 64 model confg:
{
"producer": {
"name": "modelopt",
"version": "0.11.2"
},
"architecture": "LlamaForCausalLM",
"dtype": "float16",
"num_hidden_layers": 40,
"num_attention_heads": 40,
"num_key_value_heads": 8,
"hidden_size": 5120,
"norm_epsilon": 1e-06,
"vocab_size": 50432,
"max_position_embeddings": 4096,
"hidden_act": "silu",
"use_parallel_embedding": true,
"embedding_sharding_dim": 0,
"quantization": {
"quant_algo": "W4A8_AWQ",
"kv_cache_quant_algo": null,
"group_size": 64,
"has_zero_point": false,
"pre_quant_scale": true,
"exclude_modules": [
"lm_head"
]
},
"mapping": {
"world_size": 1,
"tp_size": 1,
"pp_size": 1
},
"head_size": 128,
"intermediate_size": 13632,
"position_embedding_type": "rope_gpt_neox",
"share_embedding_table": false,
"residual_mlp": false,
"bias": true,
"rotary_pct": 1.0,
"rank": 0,
"decoder": "llama",
"rmsnorm": false,
"lm_head_bias": false,
"rotary_base": 10000.0
}
quant config:
{'quant_cfg': {'*weight_quantizer': [{'num_bits': 4, 'block_sizes': {-1: 64, 'type': 'static'}, 'enable': True}, {'num_bits': (4, 3), 'axis': None, 'enable': True}], '*input_quantizer': {'num_bits': (4, 3), 'axis': -1, 'enable': True}, '*lm_head*': {'enable': False}, '*block_sparse_moe.gate*': {'enable': False}, '*output_layer*': {'enable': False}, 'default': {'enable': False}}, 'algorithm': 'awq_lite'}
LlamaDecoderLayer:
LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): QuantLinear(
in_features=5120, out_features=5120, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.6914 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0176, 0.6953](409600) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=0.6953 calibrator=MaxCalibrator quant)
)
)
(k_proj): QuantLinear(
in_features=5120, out_features=1024, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.7852 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0266, 1.3594](81920) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=1.3594 calibrator=MaxCalibrator quant)
)
)
(v_proj): QuantLinear(
in_features=5120, out_features=1024, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.9883 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0072, 0.1631](81920) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=0.1631 calibrator=MaxCalibrator quant)
)
)
(o_proj): QuantLinear(
in_features=5120, out_features=5120, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=0.8555 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0152, 1.4062](409600) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=1.4062 calibrator=MaxCalibrator quant)
)
)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): QuantLinear(
in_features=5120, out_features=13632, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=2.9375 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0029, 1.2734](1090560) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=1.2734 calibrator=MaxCalibrator quant)
)
)
(up_proj): QuantLinear(
in_features=5120, out_features=13632, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=3.9062 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0039, 0.7305](1090560) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=0.7305 calibrator=MaxCalibrator quant)
)
)
(down_proj): QuantLinear(
in_features=13632, out_features=5120, bias=True
(input_quantizer): TensorQuantizer((4, 3) bit fake per-tensor amax=30.6250 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): SequentialQuantizer(
(0): TensorQuantizer(4 bit fake block_sizes={-1: 64, 'type': 'static'}, amax=[0.0053, 25.8750](1090560) calibrator=MaxCalibrator quant)
(1): TensorQuantizer((4, 3) bit fake per-tensor amax=25.8750 calibrator=MaxCalibrator quant)
)
)
(act_fn): SiLU()
)
(input_layernorm): LayerNorm((5120,), eps=1e-06, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((5120,), eps=1e-06, elementwise_affine=True)
)
Hi @Hongbosherlock, there are several reasons for your case:
-
w4a8_awqonly support Ada and Hopper, A100s are not supported. -
w4a8_awqonly supportgroup_size = 128at the moment. - For the assertion you encountered with, it's related to CUTLASS requirement but not
awq_block_size, so theintermediate_sizeyou used is not supported for 4-bit quantization.
w4a8_awqonly supportgroup_size = 128at the moment.
Thanks for your rely. I will eventually use L40s for w4a8_awq inference.
When I tried using v0.9 a month ago, I could successfully quantize my model using AMMO and get int4_awq and w4a8_awq engines (group_size = 64) finally. Inference with the int4_awq quantized model works fine. However, at that time, inference on Ada with w4a8_awq was not yet supported.
Now, in the new version w4a8_awq support Ada , the quantization tool is modelopt(name changed I think). However, using the same quantization configuration (group_size = 64) as before, I can't successfully get the w4a8_awq engine now.
I am very confused by the changes in the new version. Can you check it for me? Thanks.
@Barry-Delaney I think CUTLASS support group_size=64 for w4a8_awq at least two months ago, as seen in https://github.com/NVIDIA/cutlass/issues/1332
The assertion came from the preprocessor for interleaving.
- In
v0.9.0, the behavior forw4a8_awqon Ada is undefined, the successfully built engine doesn't mean it's correct for inference. - In
v0.10.0, we supported Ada andw4a8_awqis specialized as an option, hence you will run into the restrictions added only forw4a8_awq.
Regarding the update in CUTLASS, supporting group_size=64 also requires several changes in TensorRT-LLM kernels, hence it's not addressed yet.
As for your customized model, at the moment, we suggest you try to padding the intermediate_size and change the group_size. Also, please feel free to start any feature requests in case you need.
The assertion came from the preprocessor for interleaving.
- In
v0.9.0, the behavior forw4a8_awqon Ada is undefined, the successfully built engine doesn't mean it's correct for inference.- In
v0.10.0, we supported Ada andw4a8_awqis specialized as an option, hence you will run into the restrictions added only forw4a8_awq.Regarding the update in CUTLASS, supporting
group_size=64also requires several changes in TensorRT-LLM kernels, hence it's not addressed yet.As for your customized model, at the moment, we suggest you try to padding the
intermediate_sizeand change thegroup_size. Also, please feel free to start any feature requests in case you need.
Now group size 64 and 128 supported for int4_awq , but only group size 128 support for w4a8_awq.
Is group size 64 for w4a8_awq on the roadmap?
Afaik, it's not on the roadmap now.
Afaik, it's not on the roadmap now.
@Barry-Delaney When I try w4a8_awq on meta llama2-13b with intermediate_size=13824,
I run the command:
python3 run.py --max_output_len=500 \
--tokenizer_dir=/target/model/llama_13B \
--engine_dir=/target/model/trt_engines/w4a8_AWQ/1-gpu/
but got errors:
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: No valid weight only groupwise GEMM tactic(It is usually caused by the failure to execute all candidate configurations of the CUTLASS kernel, please pay attention to the warning information when building the engine.) (/target/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp:452)
1 0x7f4e9cd08073 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x56073) [0x7f4e9cd08073]
2 0x7f4e9cd9f9da tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 1786
3 0x7f4fa6d8cbec /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1060bec) [0x7f4fa6d8cbec]
4 0x7f4fa6d42217 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1016217) [0x7f4fa6d42217]
5 0x7f4fa6d43b79 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1017b79) [0x7f4fa6d43b79]
6 0x7f4ed311af34 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
7 0x7f4ed311b496 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 502
8 0x7f4ed3128094 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2164
9 0x7f4ed314c214 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 100
10 0x7f4ed314e49c tensorrt_llm::executor::Executor::Impl::executionLoop() + 380
11 0x7f508a4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f508a4b0253]
12 0x7f51348bfb43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f51348bfb43]
13 0x7f5134950bb4 clone + 68
[gajl-sys-sys-test11ef7ljc:26732] *** Process received signal ***
[gajl-sys-sys-test11ef7ljc:26732] Signal: Aborted (6)
[gajl-sys-sys-test11ef7ljc:26732] Signal code: (-6)
[gajl-sys-sys-test11ef7ljc:26732] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f513486d520]
[gajl-sys-sys-test11ef7ljc:26732] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f51348c1a7c]
[gajl-sys-sys-test11ef7ljc:26732] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f513486d476]
[gajl-sys-sys-test11ef7ljc:26732] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f51348537f3]
[gajl-sys-sys-test11ef7ljc:26732] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f508a476b9e]
[gajl-sys-sys-test11ef7ljc:26732] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f508a48220c]
[gajl-sys-sys-test11ef7ljc:26732] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f508a4811e9]
[gajl-sys-sys-test11ef7ljc:26732] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f508a481959]
[gajl-sys-sys-test11ef7ljc:26732] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f5133d7e884]
[gajl-sys-sys-test11ef7ljc:26732] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f5133d7ef41]
[gajl-sys-sys-test11ef7ljc:26732] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f508a4824cb]
[gajl-sys-sys-test11ef7ljc:26732] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x560a1)[0x7f4e9cd080a1]
[gajl-sys-sys-test11ef7ljc:26732] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins36WeightOnlyGroupwiseQuantMatmulPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x6fa)[0x7f4e9cd9f9da]
[gajl-sys-sys-test11ef7ljc:26732] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1060bec)[0x7f4fa6d8cbec]
[gajl-sys-sys-test11ef7ljc:26732] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1016217)[0x7f4fa6d42217]
[gajl-sys-sys-test11ef7ljc:26732] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1017b79)[0x7f4fa6d43b79]
[gajl-sys-sys-test11ef7ljc:26732] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching14executeContextEi+0x34)[0x7f4ed311af34]
[gajl-sys-sys-test11ef7ljc:26732] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12executeBatchERKNS0_17ScheduledRequestsE+0x1f6)[0x7f4ed311b496]
[gajl-sys-sys-test11ef7ljc:26732] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12forwardAsyncERKSt4listISt10shared_ptrINS0_10LlmRequestEESaIS5_EE+0x874)[0x7f4ed3128094]
[gajl-sys-sys-test11ef7ljc:26732] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl12forwardAsyncERSt4listISt10shared_ptrINS_13batch_manager10LlmRequestEESaIS7_EE+0x64)[0x7f4ed314c214]
[gajl-sys-sys-test11ef7ljc:26732] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x17c)[0x7f4ed314e49c]
[gajl-sys-sys-test11ef7ljc:26732] [21] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f508a4b0253]
[gajl-sys-sys-test11ef7ljc:26732] [22] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f51348bfb43]
[gajl-sys-sys-test11ef7ljc:26732] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f5134950bb4]
[gajl-sys-sys-test11ef7ljc:26732] *** End of error message ***
Aborted (core dumped)
Are you using L40s, w4a8_awq and group_size = 128 for inference? Thx!
Are you using L40s,
w4a8_awqandgroup_size = 128for inference? Thx!
yes, I used the following command on L40s:
python examples/quantization/quantize.py --model_dir /target/model/llama_13B \
--dtype float16 \
--qformat w4a8_awq \
--awq_block_size 128 \
--output_dir /target/model/quantized_w4a8_awq_13b \
--calib_size 32
trtllm-build --checkpoint_dir /target/model/quantized_w4a8_awq_13b \
--output_dir /target/model/trt_engines/w4a8_AWQ/1-gpu/ \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--context_fmha enable \
--remove_input_padding enable \
--paged_kv_cache enable \
--max_batch_size 50 \
--max_input_len 3000 \
--max_output_len 3000
``
I tried to reproduce with the command you provided, and it works fine on L40S. Here's what I got:
Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "chef in Paris before moving to London in 1837. He worked at the French restaurant, La Ville de Paris, in Regent Street, and then at the Café Royal in Regent Street. In 1846 he opened his own restaurant, the French House, in Soho, which was a great success.
In 1851 he was appointed chef at the Reform Club, where he remained for 20 years. He was famous for his elaborate dishes, which were often served in theatrical settings. He was also a pioneer of the use of gas for cooking.
Soyer’s cookbooks were very popular in their day, and he is considered one of the most important chefs of the 19th century.
1 What is the most famous French chef?
2 Who is the most famous chef in the world?
3 Who is the best chef in the world?
4 Who is the best chef in France?
5 Who is the best chef in the world 2022?
6 Who is the best chef in the world 2022 female?
7 Who is the best chef in the world 2022 male?
What is the most famous French chef?
There are many famous French chefs, but the most famous is probably Alain Ducasse. He has restaurants all over the world, and his cooking is considered some of the best in the world.
Who is the most famous chef in the world?
There is no one definitive answer to this question. However, there are a few chefs who have achieved a level of fame that is unmatched by their peers.
One of the most famous chefs in the world is Gordon Ramsay. He is a British chef who has achieved international fame for his fiery temper and his innovative cooking techniques. Ramsay has appeared on a number of television shows, including “Hell’s Kitchen” and “Kitchen Nightmares.”
Another famous chef is Jamie Oliver. He is a British chef who is known for his healthy cooking style. Oliver has appeared on a number of television shows, including “The Naked Chef” and “Jamie’s Food Revolution.”
Finally, there is Wolfgang Puck. He is an Austrian-born chef who is known for his"
Could you please try with the latest main branch and do a clean build?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Hi @Hongbosherlock do u still have further issue or question now? If not, we'll close it soon.