TensorRT-LLM [FeatureRequest] Gather sparse logprobs

Hello team,

We typically use gather_all_token_logits to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example, when running inference loads with input and output lengths of 1024 and a batch size of 32, the collected logit tensor requires 32 GB of memory (fp32).

In vllm it is possible to collect only the topk logprobs (see here). This is much more memory efficient and would be sufficient for our purposes. Is there currently a way to do this in TensorRT-LLM as well? If not, we would really appreciate this feature in both ModelRunner and ModelRunnerCpp.

This issue is somehow related to https://github.com/NVIDIA/TensorRT-LLM/issues/1040, as it would be possible to solve it on our side if it is possible to collect arbitrary model outputs.

Thank you

Apr 08 '24 09:04 Marks101

@byshiue @ncomly-nvidia we figured that this feature could be implemented on our side based on a LogitsProcessor. But currently these are not supported by the ModelRunnerCpp / tensorrt_llm.bindings.GptSession: https://github.com/NVIDIA/TensorRT-LLM/blob/71d8d4d3dc655671f32535d6d2b60cab87f36e87/tensorrt_llm/runtime/model_runner_cpp.py#L310-L312 Is there any plan to extend the support?

Apr 17 '24 08:04 Marks101

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

May 18 '24 01:05 github-actions[bot]

@Marks101, the logits processor is supported on ModelRunnerCppExecutor: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L48 Could you try that please?

May 22 '24 15:05 MartinMarciniszyn

Hi @MartinMarciniszyn thank you for the update. We will take a look at this 😃

May 23 '24 13:05 Marks101

@Marks101, the logits processor is supported on ModelRunnerCppExecutor: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L48 Could you try that please?

Hello, it looks like logits processor is disabled here....

May 24 '24 06:05 shangshng

Thanks for the feedback @shangshng. It should be support in the Python bindings of the Executor API. @dcampora, could you please add support to ModelRunnerCpp?

@Marks101, you can use the Executor API directly instead of going through ModelRunnerCpp for now.

May 24 '24 08:05 MartinMarciniszyn

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jun 24 '24 01:06 github-actions[bot]