`tensorrt_llm.bindings.Request` class is not usable for non-text inputs
System Info
TRT-LLM version: 0.12.0.dev2024070900
Who can help?
@ncomly-nvidia
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
import tensorrt_llm.bindings.executor as trtllm
# Create the executor.
executor = trtllm.Executor(engine_dir+"/decoder/", trtllm.ModelType.DECODER_ONLY, trtllm.ExecutorConfig(1))
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 32
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 32
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 114
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 114
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 3648
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 113 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 236 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1319.39 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 235 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.26 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 26.89 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 8.00 GiB, available: 5.36 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 9873
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.82 GiB for max tokens in paged KV cache (631872).
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 8.00 GiB, available: 0.53 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 984
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 0
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.48 GiB for max tokens in paged KV cache (62976).
request = trtllm.Request(
input_token_ids=[50258, 50259, 50359, 50363],
max_new_tokens=10,
encoder_input_token_ids=encoder_output[0], #torch.Size([750, 1024]) float tensor
)
the request fails because it expects a list of integers corresponding to token IDs, but if the decoder expects non-integer encoder input, the Request class is not usable and along with it the whole executor API
here I'm using whisper as an example and the decoder engine is built to support inflight batching and loads successfully but no way to use it, and I suppose this is also a problem for any other modality other than text
Expected behavior
Request class should accept multimodal inputs, not just token IDs
actual behavior
it only accepts text inputs (or tokenizable inputs)
additional notes
There's also another point on the usage of encoder-decoder models, the executor needs a single directory that contains config.json, but all examples for building encoder-decoder models build the encoder and decoder engines separately, so you have to create a separate executor for the decoder and the encoder which is not the best case for workload management
Is there any progress? Same problem
not yet
Solved in #2269