TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Can multiple threads share a GptSession? Can multiple threads call GptSession::generate() concurrently?

Open yifeihappy opened this issue 2 years ago • 4 comments

As a service, requests need to be processed concurrently. Can multiple threads share a GptSession? Can multiple threads call GptSession::generate() concurrently?

yifeihappy avatar Jan 21 '24 14:01 yifeihappy

For "requests processed concurrently", this function should be supported in gptManager with inflight batching. What's your consideration to use multi-threads on GptSession?

byshiue avatar Jan 22 '24 09:01 byshiue

For "requests processed concurrently", this function should be supported in gptManager with inflight batching. What's your consideration to use multi-threads on GptSession?

Thank you for your response. In my application scenario, I hope to process requests concurrently and obtain contextLogits. The 0.7.0 version of gptManager does not seem to have an interface for obtaining contextLogits.

yifeihappy avatar Jan 22 '24 17:01 yifeihappy

Hi @yifeihappy , the main branch now supports obtaining contextLogits under gptManager, related docs are here. You could get from SendResponseCallback, such as here, response_tensors will contain contextLogits.

yweng0828 avatar Jan 30 '24 08:01 yweng0828

Hi @yifeihappy , the main branch now supports obtaining contextLogits under gptManager, related docs are here. You could get from SendResponseCallback, such as here, response_tensors will contain contextLogits.

Thank you for your answer! I look forward to bringing this feature to the next tag release.

yifeihappy avatar Jan 31 '24 08:01 yifeihappy