v0.8.0 KeyError: 'builder_config' when benchmarking with new versions config.json
System Info
- CPU:4090 * 4
- TensorRT-LLm : v0.8.0
- CUDA Version: 12.3
- NVIDIA-SMI 545.29.06
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I quantize and build my engine follow the steps in v0.8.0 Mixtral 'Readme. md' :
python3 ../llama/convert_checkpoint.py --model_dir ./Nous-Hermes-2-Mixtral-8x7B-DPO/ --output_dir ./tllm_checkpoint_mixtral_pp2 --dtype float16 --pp_size 2 --use_weight_only --weight_only_precision int4
trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_pp2/ --output_dir ./trt_engines/mixtral/pp2 --gemm_plugin float16 --max_batch_size 32
Build successfully and run.py script is OK:
But when I run the benchmark test with the following command, an exception occurs:
python3 benchmarks/python/benchmark.py -m mixtral_8x7b --mode plugin --engine_dir examples/mixtral/trt_engines/mixtral/pp2 --batch_size 1 --input_output_len "512,128"
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 405, in <module>
main(args)
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 44, in __init__
super().__init__(args.engine_dir, args.model, args.dtype, rank,
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/base_benchmark.py", line 83, in __init__
config_dtype = self.config['builder_config']['precision']
KeyError: 'builder_config'
root@4090-node2:/tensorrtllm_backend/tensorrt_llm#
I checked the engine config file. There is no builder_config and plugin_config item whitch base_benchmark.py needed.
{
"version": "0.8.0",
"pretrained_config": {
"architecture": "MixtralForCausalLM",
"dtype": "float16",
"logits_dtype": "float32",
"vocab_size": 32002,
"max_position_embeddings": 32768,
"hidden_size": 4096,
"num_hidden_layers": 32,
"num_attention_heads": 32,
"num_key_value_heads": 8,
"head_size": 128,
"hidden_act": "swiglu",
"intermediate_size": 14336,
"norm_epsilon": 1e-05,
"position_embedding_type": "rope_gpt_neox",
"use_prompt_tuning": false,
"use_parallel_embedding": false,
"embedding_sharding_dim": 0,
"share_embedding_table": false,
"mapping": {
"world_size": 2,
"tp_size": 1,
"pp_size": 2
},
"kv_dtype": "float16",
"max_lora_rank": 64,
"rotary_base": 1000000.0,
"rotary_scaling": null,
"moe_num_experts": 8,
"moe_top_k": 2,
"moe_tp_mode": 2,
"moe_normalization_mode": 1,
"enable_pos_shift": false,
"dense_context_fmha": false,
"lora_target_modules": null,
"hf_modules_to_trtllm_modules": {
"q_proj": "attn_q",
"k_proj": "attn_k",
"v_proj": "attn_v",
"o_proj": "attn_dense",
"gate_proj": "mlp_h_to_4h",
"down_proj": "mlp_4h_to_h",
"up_proj": "mlp_gate"
},
"trtllm_modules_to_hf_modules": {
"attn_q": "q_proj",
"attn_k": "k_proj",
"attn_v": "v_proj",
"attn_dense": "o_proj",
"mlp_h_to_4h": "gate_proj",
"mlp_4h_to_h": "down_proj",
"mlp_gate": "up_proj"
},
"disable_weight_only_quant_plugin": false,
"mlp_bias": false,
"attn_bias": false,
"quantization": {
"quant_algo": "W4A16",
"kv_cache_quant_algo": null,
"group_size": 128,
"has_zero_point": false,
"pre_quant_scale": false,
"exclude_modules": [
"lm_head",
"router"
],
"sq_use_plugin": false
}
},
"build_config": {
"max_input_len": 1024,
"max_output_len": 1024,
"max_batch_size": 32,
"max_beam_width": 1,
"max_num_tokens": 32768,
"max_prompt_embedding_table_size": 0,
"gather_context_logits": false,
"gather_generation_logits": false,
"strongly_typed": false,
"builder_opt": null,
"profiling_verbosity": "layer_names_only",
"enable_debug_output": false,
"plugin_config": {
"bert_attention_plugin": "float16",
"gpt_attention_plugin": "float16",
"gemm_plugin": "float16",
"smooth_quant_gemm_plugin": null,
"identity_plugin": null,
"layernorm_quantization_plugin": null,
"rmsnorm_quantization_plugin": null,
"nccl_plugin": "float16",
"lookup_plugin": null,
"lora_plugin": null,
"weight_only_groupwise_quant_matmul_plugin": null,
"weight_only_quant_matmul_plugin": "float16",
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"context_fmha": true,
"context_fmha_fp32_acc": false,
"paged_kv_cache": true,
"remove_input_padding": true,
"use_custom_all_reduce": true,
"multi_block_mode": false,
"enable_xqa": true,
"attention_qk_half_accumulation": false,
"tokens_per_block": 128,
"use_paged_context_fmha": false,
"use_context_fmha_for_generation": false
}
}
}
Expected behavior
Benchmark should not error
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 405, in <module>
main(args)
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 44, in __init__
super().__init__(args.engine_dir, args.model, args.dtype, rank,
File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/base_benchmark.py", line 83, in __init__
config_dtype = self.config['builder_config']['precision']
KeyError: 'builder_config'
additional notes
v0.7.1 is OK.
I am not sure if there is a problem with the parameters when I build the engine or if the benchmark.py has not updated with the version.
Alse see https://github.com/triton-inference-server/tensorrtllm_backend/issues/330
I am trying to test the engine file with run.py, getting the same error. Model: Llama-2 7B Chat
There are two different ways of running models: python and cpp. run.py decides between the two here:
https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393 The python way seems to be very out of date. It will throw the builder_config error and I imagine other errors too.
The cpp way works perfectly for me so try to figure out why args.use_py_session is being set to True for you and get it to set to False in order to use the cpp way and work around these errors.
There are two different ways of running models: python and cpp.
run.pydecides between the two here:https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393
The python way seems to be very out of date. It will throw the builder_config error and I imagine other errors too. The cpp way works perfectly for me so try to figure out why
args.use_py_sessionis being set toTruefor you and get it to set toFalsein order to use the cpp way and work around these errors.
Thanks. cpp version is available.
Thank you for report. cpp runtime is the recommended way. Also, you could also try latest main branch if you stil want to run on python runtime. In latest main branch, we add a checking
if 'pretrained_config' in self.config:
...
else:
...
config_dtype = self.config['builder_config']['precision']
to fix this issue.
Thank you for report. cpp runtime is the recommended way. Also, you could also try latest main branch if you stil want to run on python runtime. In latest main branch, we add a checking
if 'pretrained_config' in self.config: ... else: ... config_dtype = self.config['builder_config']['precision']to fix this issue.
I have the save problem, when I run 'python3 run.py xxx', the error is :
Traceback (most recent call last):
File "/root/models/TensorRT-LLM/examples/llama/../run.py", line 564, in <module>
main(args)
File "/root/models/TensorRT-LLM/examples/llama/../run.py", line 340, in main
model_name, model_version = read_model_name(args.engine_dir)
File "/root/models/TensorRT-LLM/examples/utils.py", line 56, in read_model_name
return config['builder_config']['name'], None
KeyError: 'builder_config'
how can I run the trt model ?can you help me @byshiue
@byshiue is the workaround to force the script to run on cpp runtime ? I just cloned the repo and I still have the same issue with the python runtime
For any issue from other users, please also share your reproduced steps. Otherwise, I am not sure how to reproduce your issue.
run.py shouldn't require this workaround. Please make sure you use the latest ToT TensorRT-LLM to convert checkpoint, building engine and run inference.
This is resolved. Thanks! It seems like using the latest stable version helped
This is resolved. Thanks! It seems like using the latest stable version helped
You mean you used v0.8.0? I tried the latest main (commit 118b3d7) and got the error. I will test the aforementioned release to see if the error persists.
I'm on v0.8.0 and I'm facing the issue too. Is there a fix for this?
I try to test with run.py with master: 66ef1df492f7bc9c8eeb, I hit same issue as the below:
Model: baichuan2 7B
.....
No protocol specified
No protocol specified
No protocol specified
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300
Traceback (most recent call last):
File "/home/xxx/TensorRT-LLM/examples/baichuan/../run.py", line 565, in
Could you share your reproduced steps?