enc_dec: prompt_embedding_table not passed to encoder model
System Info
Tensorrt-LLM commit: 2a115dae84f13daaa54727534daa837c534eceb4 TensorRT-LLM version: 0.11.0.dev2024061800
Who can help?
No response
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Build bart-large-cnn engines using official examples (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) with two modifications:
- Add
--context_fmha disable(because of https://github.com/NVIDIA/TensorRT-LLM/issues/1883) - Add
--max_prompt_embedding_table_size 32
While running (with run.py provide --prompt_table_path tmp/ptable_1024.npy, where ptable_1024.npy was generated by
import numpy as np
table = np.random.randn(1, 10, 1024).astype(np.float32)
np.save('tmp/ptable_1024.npy', table)
Expected behavior
run.py works correctly without errors, prompt_embedding_table is passed to encoder engine
(as EncoderModel does have corresponding input https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/enc_dec/model.py#L628)
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[07/03/2024-09:52:17] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[07/03/2024-09:52:19] [TRT-LLM] [I] Load engine takes: 1.5770442485809326 sec
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Input tensor 'prompt_embedding_table' not found; expected shape: (-1, 1024) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:198)
1 0x7f47aa13a79e tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 558
2 0x7f47aa382a68 tensorrt_llm::batch_manager::TrtEncoderModel::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 120
3 0x7f47aa3863d2 tensorrt_llm::batch_manager::TrtEncoderModel::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1554
4 0x7f47aa3b7fc1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 113
5 0x7f47aa3bb6fd tensorrt_llm::executor::Executor::Impl::executionLoop() + 301
6 0x7f48eaeb0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f48eaeb0253]
7 0x7f4a69f2bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4a69f2bac3]
8 0x7f4a69fbca04 clone + 68
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/run.py", line 505, in <module>
main(args)
File "/app/tensorrt_llm/examples/run.py", line 345, in main
outputs = runner.generate(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 466, in generate
return self._initialize_and_fill_output(request_ids, end_id,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 520, in _initialize_and_fill_output
return self._fill_output(responses, output_ids, end_id, return_dict,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 556, in _fill_output
raise RuntimeError(response.error_msg)
RuntimeError: Encountered an error in forwardAsync function: Input tensor 'prompt_embedding_table' not found; expected shape: (-1, 1024) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:198)
1 0x7f47aa13a79e tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 558
2 0x7f47aa382a68 tensorrt_llm::batch_manager::TrtEncoderModel::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 120
3 0x7f47aa3863d2 tensorrt_llm::batch_manager::TrtEncoderModel::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1554
4 0x7f47aa3b7fc1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 113
5 0x7f47aa3bb6fd tensorrt_llm::executor::Executor::Impl::executionLoop() + 301
6 0x7f48eaeb0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f48eaeb0253]
7 0x7f4a69f2bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4a69f2bac3]
8 0x7f4a69fbca04 clone + 68
With the power of pdb i confirmed, that the request passed to the executor does contain a valid prompt_tuning_configs -- with emedding_table of shape (10, 1024) and dtype float32
additional notes
I understand that 0.11.0dev is not a stable version of tensorrt-llm, but, hopefully, this will be fixed in a stable release (or sooner)
The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples
The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples
@thefacetakt Would you mind telling more details? Are you going to do prompt-tuning for whisper?
@yuekaizhang
Well, the plan is:
- modify
WhisperEncoderto have the same signature as regularEncoderModel - use
prompt_embedding_tableinput to pass actual fbanks features toWhisperEncoder - use tritonserver with tensorrtllm_backend for inference.
Seems like it should work?
@yuekaizhang
Well, the plan is:
- modify
WhisperEncoderto have the same signature as regularEncoderModel- use
prompt_embedding_tableinput to pass actual fbanks features toWhisperEncoder- use tritonserver with tensorrtllm_backend for inference.
Seems like it should work?
@thefacetakt We're currently implementing whisper support for triton tensorrt-llm backend. You could wait for the relase. Or you could try the python backend first https://github.com/k2-fsa/sherpa/tree/master/triton/whisper.
@thefacetakt if you have no further questions, we will close this issue in one week.