[TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:160)
Firstly, I would like to commend you on the recent KV cache reuse functionality. It's an incredible piece of work. We have been using it on a llama 70B model and have seen significant improvements. The time taken for the first token has been reduced from 1s to 100ms, which is an astounding result.
However, we have encountered an issue during our usage. When the number of requests reaches a certain threshold, the model crashes. We are currently unable to identify the root cause of this issue and it's proving to be a significant roadblock in our operations.
We would greatly appreciate any assistance or guidance you could provide to help us resolve this issue. If you need any additional information or data to assist in troubleshooting, please let us know.
Thank you for your time and consideration.
Log: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:160) 1 0x7f396b49d68d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1668d) [0x7f396b49d68d] 2 0x7f396b4a4ec1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1dec1) [0x7f396b4a4ec1] 3 0x7f396b5218b5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9a8b5) [0x7f396b5218b5] 4 0x7f396b521e9b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9ae9b) [0x7f396b521e9b] 5 0x7f396b523494 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9c494) [0x7f396b523494] 6 0x7f396b523b93 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9cb93) [0x7f396b523b93] 7 0x7f396b525141 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9e141) [0x7f396b525141] 8 0x7f396b52a537 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa3537) [0x7f396b52a537] 9 0x7f396b530886 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa9886) [0x7f396b530886] 10 0x7f396b4ff2ea /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x782ea) [0x7f396b4ff2ea] 11 0x7f396b50223a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7b23a) [0x7f396b50223a] 12 0x7f396b504765 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7d765) [0x7f396b504765] 13 0x7f396b4edd24 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66d24) [0x7f396b4edd24] 14 0x7f396b4f3dff /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6cdff) [0x7f396b4f3dff] 15 0x7f3b09064253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3b09064253] 16 0x7f3b08df4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3b08df4ac3] 17 0x7f3b08e85bf4 clone + 68
Build Script: python build.py --rotary_scaling linear 2 --model_dir /datas/models/llama-70b --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --max_input_len 8192 --max_output_len 4096 --use_inflight_batching --paged_kv_cache --vocab_size 37632 --n_layer 80 --n_head 64 --n_kv_head 8 --n_positions 4096 --n_embd 8192 --world_size 8 --tp_size 8 --use_paged_context_fmha --max_batch_size 20
Device: 8 * A100 80G
I also met this problem on branch v0.7.1 But the main branch now is OK
the same issue on tensorrt-llm 0.7.1 [TensorRT-LLM][ERROR] Encountered error for requestId 487602187: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/release-0.7/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:160) 1 0x7fd6cf49d68d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1668d) [0x7fd6cf49d68d] 2 0x7fd6cf4a4ebf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1debf) [0x7fd6cf4a4ebf] 3 0x7fd6cf521945 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9a945) [0x7fd6cf521945] 4 0x7fd6cf521f2b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9af2b) [0x7fd6cf521f2b] 5 0x7fd6cf523524 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9c524) [0x7fd6cf523524] 6 0x7fd6cf523c23 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9cc23) [0x7fd6cf523c23] 7 0x7fd6cf5251d1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9e1d1) [0x7fd6cf5251d1] 8 0x7fd6cf52a5c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa35c7) [0x7fd6cf52a5c7] 9 0x7fd6cf5308f6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa98f6) [0x7fd6cf5308f6] 10 0x7fd6cf4ff34a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7834a) [0x7fd6cf4ff34a] 11 0x7fd6cf50229a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7b29a) [0x7fd6cf50229a] 12 0x7fd6cf5047c5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7d7c5) [0x7fd6cf5047c5] 13 0x7fd6cf4edc44 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66c44) [0x7fd6cf4edc44] 14 0x7fd6cf4f3d1f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6cd1f) [0x7fd6cf4f3d1f] 15 0x7fd739c64253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd739c64253] 16 0x7fd7399f4ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd7399f4ac3] 17 0x7fd739a85bf4 clone + 68
here is bulid shell
python $BASE_DIR/build.py --hf_model_dir $MODEL_DIR \
--dtype float16 \
--max_batch_size 16 \
--max_output_len 1024 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_weight_only \
--weight_only_precision int8 \
--output_dir $OUTPUT_DIR \
--use_paged_context_fmha \
--**enable_context_fmha** \
--use_inflight_batching \
and here is tensorrt-llm config
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 16
input {
name: "input_ids"
data_type: TYPE_INT32
dims: [-1]
allow_ragged_batch: true
}
input {
name: "input_lengths"
data_type: TYPE_INT32
dims: [1]
reshape {
}
}
input {
name: "request_output_len"
data_type: TYPE_UINT32
dims: [1]
}
input {
name: "end_id"
data_type: TYPE_UINT32
dims: [1]
reshape {
}
optional: true
}
input {
name: "pad_id"
data_type: TYPE_UINT32
dims: [1]
reshape {
}
optional: true
}
input {
name: "stop_words_list"
data_type: TYPE_INT32
dims: [2, -1]
allow_ragged_batch: true
optional: true
}
input {
name: "bad_words_list"
data_type: TYPE_INT32
dims: [2, -1]
allow_ragged_batch: true
optional: true
}
input {
name: "embedding_bias"
data_type: TYPE_FP32
dims: [-1]
allow_ragged_batch: true
optional: true
}
input {
name: "beam_width"
data_type: TYPE_UINT32
dims: [1]
reshape {
}
optional: true
}
input {
name: "temperature"
data_type: TYPE_FP32
dims: [1]
reshape {
}
optional: true
}
input {
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [1]
reshape {
}
optional: true
}
input {
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [1]
reshape {
}
optional: true
}
input {
name: "len_penalty"
data_type: TYPE_FP32
dims: [1]
reshape {
}
optional: true
}
input {
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [1]
reshape {
}
optional: true
}
input {
name: "min_length"
data_type: TYPE_UINT32
dims: [1]
reshape {
}
optional: true
}
input {
name: "presence_penalty"
data_type: TYPE_FP32
dims: [1]
reshape {
}
optional: true
}
input {
name: "random_seed"
data_type: TYPE_UINT64
dims: [1]
reshape {
}
optional: true
}
input {
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [1]
reshape {
}
optional: true
}
input {
name: "stop"
data_type: TYPE_BOOL
dims: [1]
optional: true
}
input {
name: "streaming"
data_type: TYPE_BOOL
dims: [1]
optional: true
}
input {
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [-1, -1]
allow_ragged_batch: true
optional: true
}
input {
name: "prompt_vocab_size"
data_type: TYPE_UINT32
dims: [1]
reshape {
}
optional: true
}
output {
name: "output_ids"
data_type: TYPE_INT32
dims: [-1, -1]
}
output {
name: "sequence_length"
data_type: TYPE_INT32
dims: [-1]
}
output {
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [-1]
}
output {
name: "output_log_probs"
data_type: TYPE_FP32
dims: [-1, -1]
}
instance_group {
kind: KIND_CPU
count: 1
}
parameters {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value {
string_value: "no"
}
}
parameters {
key: "batch_scheduler_policy"
value {
string_value: "max_utilization"
}
}
parameters {
key: "enable_trt_overlap"
value {
string_value: "False"
}
}
parameters {
key: "gpt_model_path"
value {
string_value: "/workspace/[email protected]/engine_model/ss_engine_model/qwen-14b-merge-dora-0426-v5-a100-fmha-0.7.1"
}
}
parameters {
key: "gpt_model_type"
value {
string_value: "inflight_fused_batching"
}
}
parameters {
key: "kv_cache_free_gpu_mem_fraction"
value {
string_value: "0.8"
}
}
parameters {
key: "max_beam_width"
value {
string_value: "1"
}
}
parameters: {
key: "**enable_kv_cache_reuse**"
value: {
string_value: "True"
}
}
parameters {
key: "max_num_sequences"
value {
string_value: "8"
}
}
model_transaction_policy {
}
the same issue on tensorrt-llm 0.7.1 [TensorRT-LLM][ERROR] Encountered error for requestId 487602187: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/release-0.7/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:160) 1 0x7fd6cf49d68d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1668d) [0x7fd6cf49d68d] 2 0x7fd6cf4a4ebf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1debf) [0x7fd6cf4a4ebf] 3 0x7fd6cf521945 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9a945) [0x7fd6cf521945] 4 0x7fd6cf521f2b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9af2b) [0x7fd6cf521f2b] 5 0x7fd6cf523524 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9c524) [0x7fd6cf523524] 6 0x7fd6cf523c23 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9cc23) [0x7fd6cf523c23] 7 0x7fd6cf5251d1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x9e1d1) [0x7fd6cf5251d1] 8 0x7fd6cf52a5c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa35c7) [0x7fd6cf52a5c7] 9 0x7fd6cf5308f6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa98f6) [0x7fd6cf5308f6] 10 0x7fd6cf4ff34a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7834a) [0x7fd6cf4ff34a] 11 0x7fd6cf50229a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7b29a) [0x7fd6cf50229a] 12 0x7fd6cf5047c5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7d7c5) [0x7fd6cf5047c5] 13 0x7fd6cf4edc44 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66c44) [0x7fd6cf4edc44] 14 0x7fd6cf4f3d1f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6cd1f) [0x7fd6cf4f3d1f] 15 0x7fd739c64253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd739c64253] 16 0x7fd7399f4ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd7399f4ac3] 17 0x7fd739a85bf4 clone + 68
here is bulid shell
python $BASE_DIR/build.py --hf_model_dir $MODEL_DIR \ --dtype float16 \ --max_batch_size 16 \ --max_output_len 1024 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --use_weight_only \ --weight_only_precision int8 \ --output_dir $OUTPUT_DIR \ --use_paged_context_fmha \ --**enable_context_fmha** \ --use_inflight_batching \and here is tensorrt-llm config
name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 16 input { name: "input_ids" data_type: TYPE_INT32 dims: [-1] allow_ragged_batch: true } input { name: "input_lengths" data_type: TYPE_INT32 dims: [1] reshape { } } input { name: "request_output_len" data_type: TYPE_UINT32 dims: [1] } input { name: "end_id" data_type: TYPE_UINT32 dims: [1] reshape { } optional: true } input { name: "pad_id" data_type: TYPE_UINT32 dims: [1] reshape { } optional: true } input { name: "stop_words_list" data_type: TYPE_INT32 dims: [2, -1] allow_ragged_batch: true optional: true } input { name: "bad_words_list" data_type: TYPE_INT32 dims: [2, -1] allow_ragged_batch: true optional: true } input { name: "embedding_bias" data_type: TYPE_FP32 dims: [-1] allow_ragged_batch: true optional: true } input { name: "beam_width" data_type: TYPE_UINT32 dims: [1] reshape { } optional: true } input { name: "temperature" data_type: TYPE_FP32 dims: [1] reshape { } optional: true } input { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [1] reshape { } optional: true } input { name: "runtime_top_p" data_type: TYPE_FP32 dims: [1] reshape { } optional: true } input { name: "len_penalty" data_type: TYPE_FP32 dims: [1] reshape { } optional: true } input { name: "repetition_penalty" data_type: TYPE_FP32 dims: [1] reshape { } optional: true } input { name: "min_length" data_type: TYPE_UINT32 dims: [1] reshape { } optional: true } input { name: "presence_penalty" data_type: TYPE_FP32 dims: [1] reshape { } optional: true } input { name: "random_seed" data_type: TYPE_UINT64 dims: [1] reshape { } optional: true } input { name: "return_log_probs" data_type: TYPE_BOOL dims: [1] reshape { } optional: true } input { name: "stop" data_type: TYPE_BOOL dims: [1] optional: true } input { name: "streaming" data_type: TYPE_BOOL dims: [1] optional: true } input { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: [-1, -1] allow_ragged_batch: true optional: true } input { name: "prompt_vocab_size" data_type: TYPE_UINT32 dims: [1] reshape { } optional: true } output { name: "output_ids" data_type: TYPE_INT32 dims: [-1, -1] } output { name: "sequence_length" data_type: TYPE_INT32 dims: [-1] } output { name: "cum_log_probs" data_type: TYPE_FP32 dims: [-1] } output { name: "output_log_probs" data_type: TYPE_FP32 dims: [-1, -1] } instance_group { kind: KIND_CPU count: 1 } parameters { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value { string_value: "no" } } parameters { key: "batch_scheduler_policy" value { string_value: "max_utilization" } } parameters { key: "enable_trt_overlap" value { string_value: "False" } } parameters { key: "gpt_model_path" value { string_value: "/workspace/[email protected]/engine_model/ss_engine_model/qwen-14b-merge-dora-0426-v5-a100-fmha-0.7.1" } } parameters { key: "gpt_model_type" value { string_value: "inflight_fused_batching" } } parameters { key: "kv_cache_free_gpu_mem_fraction" value { string_value: "0.8" } } parameters { key: "max_beam_width" value { string_value: "1" } } parameters: { key: "**enable_kv_cache_reuse**" value: { string_value: "True" } } parameters { key: "max_num_sequences" value { string_value: "8" } } model_transaction_policy { }
update info ,use the tensorrt-llm 0.9.0 fixed this problem
issue still exists https://github.com/NVIDIA/TensorRT-LLM/issues/2708