TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Failed to deserialize cuda engine when using "tp_size=4"

Open PeterWang1986 opened this issue 2 years ago • 0 comments

System Info

  • CPU architecture (x86_64)
  • GPU name (NVIDIA A10)
  • TensorRT-LLM commit (build from tensorrtllm_backend which commit is: 3608b0)

Who can help?

Hi all, We use "triton + tensorrtllm_backend + TensorRT-LLM" to deploy mistral-7b model. We build model with "tp_size=4", and deploy the engine in A10 gpus, but it always failed due to "UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)"

Here is my build config: "build_config": { "max_input_len": 16384, "max_output_len": 1024, "max_batch_size": 8, "max_beam_width": 1, "max_num_tokens": 8192, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": false, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_context_fmha_for_generation": false } It also failed on release0.7.1

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. start container(triton_trt_llm:main-3608b0), which build from https://github.com/triton-inference-server/tensorrtllm_backend/tree/main Option2
  2. convert checkpoint: python /app/tensorrt_llm/examples/llama/convert_checkpoint.py
    --model_dir xxxx
    --output_dir xxx
    --dtype float16
    --tp_size 4
  3. build engine: trtllm-build
    --checkpoint_dir xxxx
    --output_dir xxxx
    --gpt_attention_plugin float16
    --gemm_plugin float16
    --remove_input_padding enable
    --context_fmha enable
    --paged_kv_cache enable
    --use_custom_all_reduce disable
    --max_input_len=16384
    --max_output_len=1024
    --max_num_tokens=8192
    --max_batch_size=8
  4. start triton server: python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=xxxx

Expected behavior

load model successful

actual behavior

E0202 05:43:08.193459 60 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72) [2024-02-02 13:43:08] 1 0x7f7e8c2617da tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 [2024-02-02 13:43:08] 2 0x7f7e8c28522e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x79e22e) [0x7f7e8c28522e] [2024-02-02 13:43:08] 3 0x7f7e8e150ea1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1025 [2024-02-02 13:43:08] 4 0x7f7e8e1275a9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1449 [2024-02-02 13:43:08] 5 0x7f7e8e11d3a0 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 320 [2024-02-02 13:43:08] 6 0x7f7f90028a11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x19a11) [0x7f7f90028a11] [2024-02-02 13:43:08] 7 0x7f7f90029c52 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1ac52) [0x7f7f90029c52] [2024-02-02 13:43:08] 8 0x7f7f9001afc5 TRITONBACKEND_ModelInstanceInitialize + 101 [2024-02-02 13:43:08] 9 0x7f7fa9b89226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7f7fa9b89226] [2024-02-02 13:43:08] 10 0x7f7fa9b8a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7f7fa9b8a466] [2024-02-02 13:43:08] 11 0x7f7fa9b6d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7f7fa9b6d165] [2024-02-02 13:43:08] 12 0x7f7fa9b6d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7f7fa9b6d7a6] [2024-02-02 13:43:08] 13 0x7f7fa9b79a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7f7fa9b79a1d] [2024-02-02 13:43:08] 14 0x7f7fa91e4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f7fa91e4ee8] [2024-02-02 13:43:08] 15 0x7f7fa9b63feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7f7fa9b63feb] [2024-02-02 13:43:08] 16 0x7f7fa9b73dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7f7fa9b73dc5] [2024-02-02 13:43:08] 17 0x7f7fa9b78d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7f7fa9b78d36] [2024-02-02 13:43:08] 18 0x7f7fa9c69330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7f7fa9c69330] [2024-02-02 13:43:08] 19 0x7f7fa9c6ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7f7fa9c6ca23] [2024-02-02 13:43:08] 20 0x7f7fa9dc0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7f7fa9dc0d82] [2024-02-02 13:43:08] 21 0x7f7fa944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7fa944f253] [2024-02-02 13:43:08] 22 0x7f7fa91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7fa91dfac3] [2024-02-02 13:43:08] 23 0x7f7fa9270814 clone + 68 [2024-02-02 13:43:08] I0202 05:43:08.193560 60 model_lifecycle.cc:756] failed to load 'tensorrt_llm'

additional notes

everything is work fine for: tp_size =1, tp_size = 2 And I build the model engine on single A10 gpu, and deploy the engine on other A10 GPUs node.

PeterWang1986 avatar Feb 02 '24 06:02 PeterWang1986