TensorRT-LLM `tensorrt_llm.bindings.Request` class is not usable for non-text inputs

System Info

TRT-LLM version: 0.12.0.dev2024070900

Who can help?

@ncomly-nvidia

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

import tensorrt_llm.bindings.executor as trtllm
# Create the executor.
executor = trtllm.Executor(engine_dir+"/decoder/", trtllm.ModelType.DECODER_ONLY, trtllm.ExecutorConfig(1))

[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 32
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 32
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 114
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 114
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 3648
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 113 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 236 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1319.39 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 235 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.26 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 26.89 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 8.00 GiB, available: 5.36 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 9873
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.82 GiB for max tokens in paged KV cache (631872).
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 8.00 GiB, available: 0.53 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 984
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 0
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.48 GiB for max tokens in paged KV cache (62976).

request = trtllm.Request(
    input_token_ids=[50258, 50259, 50359, 50363],
    max_new_tokens=10,
    encoder_input_token_ids=encoder_output[0], #torch.Size([750, 1024]) float tensor
)

the request fails because it expects a list of integers corresponding to token IDs, but if the decoder expects non-integer encoder input, the Request class is not usable and along with it the whole executor API

here I'm using whisper as an example and the decoder engine is built to support inflight batching and loads successfully but no way to use it, and I suppose this is also a problem for any other modality other than text

Expected behavior

Request class should accept multimodal inputs, not just token IDs

actual behavior

it only accepts text inputs (or tokenizable inputs)

additional notes

There's also another point on the usage of encoder-decoder models, the executor needs a single directory that contains config.json, but all examples for building encoder-decoder models build the encoder and decoder engines separately, so you have to create a separate executor for the decoder and the encoder which is not the best case for workload management

Jul 12 '24 12:07 MahmoudAshraf97

Is there any progress? Same problem

Aug 05 '24 02:08 Popsicle0-0

not yet

Aug 12 '24 13:08 MahmoudAshraf97

Solved in #2269

Oct 14 '24 14:10 MahmoudAshraf97