TensorRT-LLM Can multiple threads share a GptSession? Can multiple threads call GptSession::generate() concurrently?

As a service, requests need to be processed concurrently. Can multiple threads share a GptSession? Can multiple threads call GptSession::generate() concurrently?

Jan 21 '24 14:01 yifeihappy

For "requests processed concurrently", this function should be supported in gptManager with inflight batching. What's your consideration to use multi-threads on GptSession?

Jan 22 '24 09:01 byshiue

For "requests processed concurrently", this function should be supported in gptManager with inflight batching. What's your consideration to use multi-threads on GptSession?

Thank you for your response. In my application scenario, I hope to process requests concurrently and obtain contextLogits. The 0.7.0 version of gptManager does not seem to have an interface for obtaining contextLogits.

Jan 22 '24 17:01 yifeihappy

Hi @yifeihappy , the main branch now supports obtaining contextLogits under gptManager, related docs are here. You could get from SendResponseCallback, such as here, response_tensors will contain contextLogits.

Jan 30 '24 08:01 yweng0828

Hi @yifeihappy , the main branch now supports obtaining contextLogits under gptManager, related docs are here. You could get from SendResponseCallback, such as here, response_tensors will contain contextLogits.

Thank you for your answer! I look forward to bringing this feature to the next tag release.

Jan 31 '24 08:01 yifeihappy