Failed to deserialize cuda engine when using "tp_size=4"
System Info
- CPU architecture (x86_64)
- GPU name (NVIDIA A10)
- TensorRT-LLM commit (build from tensorrtllm_backend which commit is: 3608b0)
Who can help?
Hi all, We use "triton + tensorrtllm_backend + TensorRT-LLM" to deploy mistral-7b model. We build model with "tp_size=4", and deploy the engine in A10 gpus, but it always failed due to "UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)"
Here is my build config: "build_config": { "max_input_len": 16384, "max_output_len": 1024, "max_batch_size": 8, "max_beam_width": 1, "max_num_tokens": 8192, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": false, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_context_fmha_for_generation": false } It also failed on release0.7.1
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- start container(triton_trt_llm:main-3608b0), which build from https://github.com/triton-inference-server/tensorrtllm_backend/tree/main Option2
- convert checkpoint: python /app/tensorrt_llm/examples/llama/convert_checkpoint.py
--model_dir xxxx
--output_dir xxx
--dtype float16
--tp_size 4 - build engine: trtllm-build
--checkpoint_dir xxxx
--output_dir xxxx
--gpt_attention_plugin float16
--gemm_plugin float16
--remove_input_padding enable
--context_fmha enable
--paged_kv_cache enable
--use_custom_all_reduce disable
--max_input_len=16384
--max_output_len=1024
--max_num_tokens=8192
--max_batch_size=8 - start triton server: python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=xxxx
Expected behavior
load model successful
actual behavior
E0202 05:43:08.193459 60 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
[2024-02-02 13:43:08] 1 0x7f7e8c2617da tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
[2024-02-02 13:43:08] 2 0x7f7e8c28522e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x79e22e) [0x7f7e8c28522e]
[2024-02-02 13:43:08] 3 0x7f7e8e150ea1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator
additional notes
everything is work fine for: tp_size =1, tp_size = 2 And I build the model engine on single A10 gpu, and deploy the engine on other A10 GPUs node.