Failed to deserialize cuda engine when using "tp_size=4"

Open PeterWang1986 opened this issue 2 years ago • 0 comments

System Info

CPU architecture (x86_64)
GPU name (NVIDIA A10)
TensorRT-LLM commit (build from tensorrtllm_backend which commit is: 3608b0)

Who can help?

Hi all, We use "triton + tensorrtllm_backend + TensorRT-LLM" to deploy mistral-7b model. We build model with "tp_size=4", and deploy the engine in A10 gpus, but it always failed due to "UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)"

Here is my build config: "build_config": { "max_input_len": 16384, "max_output_len": 1024, "max_batch_size": 8, "max_beam_width": 1, "max_num_tokens": 8192, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": false, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_context_fmha_for_generation": false } It also failed on release0.7.1

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

start container(triton_trt_llm:main-3608b0), which build from https://github.com/triton-inference-server/tensorrtllm_backend/tree/main Option2
convert checkpoint: python /app/tensorrt_llm/examples/llama/convert_checkpoint.py
--model_dir xxxx
--output_dir xxx
--dtype float16
--tp_size 4
build engine: trtllm-build
--checkpoint_dir xxxx
--output_dir xxxx
--gpt_attention_plugin float16
--gemm_plugin float16
--remove_input_padding enable
--context_fmha enable
--paged_kv_cache enable
--use_custom_all_reduce disable
--max_input_len=16384
--max_output_len=1024
--max_num_tokens=8192
--max_batch_size=8
start triton server: python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=xxxx

Expected behavior

load model successful

actual behavior

E0202 05:43:08.193459 [2024-02-02 13:43:08] 1 [2024-02-02 13:43:08] 2 [2024-02-02 13:43:08] 3 [2024-02-02 13:43:08] 4 [2024-02-02 13:43:08] 5 [2024-02-02 13:43:08] 6 [2024-02-02 13:43:08] 7 [2024-02-02 13:43:08] 8 [2024-02-02 13:43:08] 9 [2024-02-02 13:43:08] 10 [2024-02-02 13:43:08] 11 [2024-02-02 13:43:08] 12 [2024-02-02 13:43:08] 13 [2024-02-02 13:43:08] 14 [2024-02-02 13:43:08] 15 [2024-02-02 13:43:08] 16 [2024-02-02 13:43:08] 17 [2024-02-02 13:43:08] 18 [2024-02-02 13:43:08] 19 [2024-02-02 13:43:08] 20 [2024-02-02 13:43:08] 21 [2024-02-02 13:43:08] 22 [2024-02-02 13:43:08] 23 [2024-02-02 13:43:08] 60 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72) 0x7f7e8c2617da tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 0x7f7e8c28522e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x79e22e) [0x7f7e8c28522e] 0x7f7e8e150ea1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1025 0x7f7e8e1275a9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1449 0x7f7e8e11d3a0 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 320 0x7f7f90028a11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x19a11) [0x7f7f90028a11] 0x7f7f90029c52 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1ac52) [0x7f7f90029c52] 0x7f7f9001afc5 TRITONBACKEND_ModelInstanceInitialize + 101 0x7f7fa9b89226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7f7fa9b89226] 0x7f7fa9b8a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7f7fa9b8a466] 0x7f7fa9b6d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7f7fa9b6d165] 0x7f7fa9b6d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7f7fa9b6d7a6] 0x7f7fa9b79a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7f7fa9b79a1d] 0x7f7fa91e4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f7fa91e4ee8] 0x7f7fa9b63feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7f7fa9b63feb] 0x7f7fa9b73dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7f7fa9b73dc5] 0x7f7fa9b78d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7f7fa9b78d36] 0x7f7fa9c69330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7f7fa9c69330] 0x7f7fa9c6ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7f7fa9c6ca23] 0x7f7fa9dc0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7f7fa9dc0d82] 0x7f7fa944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7fa944f253] 0x7f7fa91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7fa91dfac3] 0x7f7fa9270814 clone + 68 I0202 05:43:08.193560 60 model_lifecycle.cc:756] failed to load 'tensorrt_llm'

additional notes

everything is work fine for: tp_size =1, tp_size = 2 And I build the model engine on single A10 gpu, and deploy the engine on other A10 GPUs node.

Feb 02 '24 06:02 PeterWang1986