TensorRT-LLM v0.8.0 KeyError: 'builder_config' when benchmarking with new versions config.json

System Info

CPU：4090 * 4
TensorRT-LLm : v0.8.0
CUDA Version: 12.3
NVIDIA-SMI 545.29.06

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I quantize and build my engine follow the steps in v0.8.0 Mixtral 'Readme. md' :

python3 ../llama/convert_checkpoint.py --model_dir ./Nous-Hermes-2-Mixtral-8x7B-DPO/  --output_dir ./tllm_checkpoint_mixtral_pp2 --dtype float16  --pp_size 2  --use_weight_only --weight_only_precision int4

trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_pp2/ --output_dir ./trt_engines/mixtral/pp2  --gemm_plugin float16 --max_batch_size 32

Build successfully and run.py script is OK:

But when I run the benchmark test with the following command, an exception occurs:

python3 benchmarks/python/benchmark.py  -m mixtral_8x7b --mode plugin   --engine_dir examples/mixtral/trt_engines/mixtral/pp2 --batch_size 1  --input_output_len "512,128"

[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 405, in <module>
    main(args)
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
    benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 44, in __init__
    super().__init__(args.engine_dir, args.model, args.dtype, rank,
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/base_benchmark.py", line 83, in __init__
    config_dtype = self.config['builder_config']['precision']
KeyError: 'builder_config'
root@4090-node2:/tensorrtllm_backend/tensorrt_llm#

I checked the engine config file. There is no builder_config and plugin_config item whitch base_benchmark.py needed.

{
    "version": "0.8.0",
    "pretrained_config": {
        "architecture": "MixtralForCausalLM",
        "dtype": "float16",
        "logits_dtype": "float32",
        "vocab_size": 32002,
        "max_position_embeddings": 32768,
        "hidden_size": 4096,
        "num_hidden_layers": 32,
        "num_attention_heads": 32,
        "num_key_value_heads": 8,
        "head_size": 128,
        "hidden_act": "swiglu",
        "intermediate_size": 14336,
        "norm_epsilon": 1e-05,
        "position_embedding_type": "rope_gpt_neox",
        "use_prompt_tuning": false,
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "mapping": {
            "world_size": 2,
            "tp_size": 1,
            "pp_size": 2
        },
        "kv_dtype": "float16",
        "max_lora_rank": 64,
        "rotary_base": 1000000.0,
        "rotary_scaling": null,
        "moe_num_experts": 8,
        "moe_top_k": 2,
        "moe_tp_mode": 2,
        "moe_normalization_mode": 1,
        "enable_pos_shift": false,
        "dense_context_fmha": false,
        "lora_target_modules": null,
        "hf_modules_to_trtllm_modules": {
            "q_proj": "attn_q",
            "k_proj": "attn_k",
            "v_proj": "attn_v",
            "o_proj": "attn_dense",
            "gate_proj": "mlp_h_to_4h",
            "down_proj": "mlp_4h_to_h",
            "up_proj": "mlp_gate"
        },
        "trtllm_modules_to_hf_modules": {
            "attn_q": "q_proj",
            "attn_k": "k_proj",
            "attn_v": "v_proj",
            "attn_dense": "o_proj",
            "mlp_h_to_4h": "gate_proj",
            "mlp_4h_to_h": "down_proj",
            "mlp_gate": "up_proj"
        },
        "disable_weight_only_quant_plugin": false,
        "mlp_bias": false,
        "attn_bias": false,
        "quantization": {
            "quant_algo": "W4A16",
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": [
                "lm_head",
                "router"
            ],
            "sq_use_plugin": false
        }
    },
    "build_config": {
        "max_input_len": 1024,
        "max_output_len": 1024,
        "max_batch_size": 32,
        "max_beam_width": 1,
        "max_num_tokens": 32768,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": "float16",
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": "float16",
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": "float16",
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 128,
            "use_paged_context_fmha": false,
            "use_context_fmha_for_generation": false
        }
    }
}

Expected behavior

Benchmark should not error

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 405, in <module>
    main(args)
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
    benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 44, in __init__
    super().__init__(args.engine_dir, args.model, args.dtype, rank,
  File "/tensorrtllm_backend/tensorrt_llm/benchmarks/python/base_benchmark.py", line 83, in __init__
    config_dtype = self.config['builder_config']['precision']
KeyError: 'builder_config'

additional notes

v0.7.1 is OK.

Mar 07 '24 03:03 plt12138

I am not sure if there is a problem with the parameters when I build the engine or if the benchmark.py has not updated with the version.

Alse see https://github.com/triton-inference-server/tensorrtllm_backend/issues/330

Mar 07 '24 03:03 plt12138

I am trying to test the engine file with run.py, getting the same error. Model: Llama-2 7B Chat

Mar 07 '24 10:03 rbgo404

There are two different ways of running models: python and cpp. run.py decides between the two here: https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393 The python way seems to be very out of date. It will throw the builder_config error and I imagine other errors too. The cpp way works perfectly for me so try to figure out why args.use_py_session is being set to True for you and get it to set to False in order to use the cpp way and work around these errors.

Mar 08 '24 07:03 iibw

There are two different ways of running models: python and cpp. run.py decides between the two here:

https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393

The python way seems to be very out of date. It will throw the builder_config error and I imagine other errors too. The cpp way works perfectly for me so try to figure out why args.use_py_session is being set to True for you and get it to set to False in order to use the cpp way and work around these errors.

Thanks. cpp version is available.

Mar 11 '24 07:03 plt12138

Thank you for report. cpp runtime is the recommended way. Also, you could also try latest main branch if you stil want to run on python runtime. In latest main branch, we add a checking

if 'pretrained_config' in self.config:
    ...
else:
    ...
    config_dtype = self.config['builder_config']['precision']

to fix this issue.

Mar 12 '24 03:03 byshiue

Thank you for report. cpp runtime is the recommended way. Also, you could also try latest main branch if you stil want to run on python runtime. In latest main branch, we add a checking
if 'pretrained_config' in self.config:
    ...
else:
    ...
    config_dtype = self.config['builder_config']['precision']
to fix this issue.

I have the save problem, when I run 'python3 run.py xxx', the error is :

Traceback (most recent call last):
  File "/root/models/TensorRT-LLM/examples/llama/../run.py", line 564, in <module>
    main(args)
  File "/root/models/TensorRT-LLM/examples/llama/../run.py", line 340, in main
    model_name, model_version = read_model_name(args.engine_dir)
  File "/root/models/TensorRT-LLM/examples/utils.py", line 56, in read_model_name
    return config['builder_config']['name'], None
KeyError: 'builder_config'

how can I run the trt model ?can you help me @byshiue

Mar 28 '24 10:03 dh12306

@byshiue is the workaround to force the script to run on cpp runtime ? I just cloned the repo and I still have the same issue with the python runtime

Mar 30 '24 19:03 c3-moutasem

For any issue from other users, please also share your reproduced steps. Otherwise, I am not sure how to reproduce your issue.

run.py shouldn't require this workaround. Please make sure you use the latest ToT TensorRT-LLM to convert checkpoint, building engine and run inference.

Apr 01 '24 09:04 byshiue

This is resolved. Thanks! It seems like using the latest stable version helped

Apr 01 '24 21:04 c3-moutasem

This is resolved. Thanks! It seems like using the latest stable version helped

You mean you used v0.8.0? I tried the latest main (commit 118b3d7) and got the error. I will test the aforementioned release to see if the error persists.

Apr 03 '24 16:04 nelsonspbr

I'm on v0.8.0 and I'm facing the issue too. Is there a fix for this?

Apr 16 '24 19:04 adb1997

I try to test with run.py with master: 66ef1df492f7bc9c8eeb, I hit same issue as the below: Model: baichuan2 7B ..... No protocol specified No protocol specified No protocol specified [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 Traceback (most recent call last): File "/home/xxx/TensorRT-LLM/examples/baichuan/../run.py", line 565, in main(args) File "/home/xxx/TensorRT-LLM/examples/baichuan/../run.py", line 340, in main model_name, model_version = read_model_name(args.engine_dir) File "/home/xxx/TensorRT-LLM/examples/utils.py", line 55, in read_model_name return config['builder_config']['name'], None KeyError: 'builder_config' .....

Apr 29 '24 06:04 st7109

Could you share your reproduced steps?

May 09 '24 06:05 byshiue