YueWeng comments

Results 5 comments of


                                            YueWeng

Can multiple threads share a GptSession? Can multiple threads call GptSession::generate() concurrently?

Hi @yifeihappy , the main branch now supports obtaining `contextLogits` under gptManager, related docs are [here](https://github.com/NVIDIA/TensorRT-LLM/issues/926). You could get from `SendResponseCallback`, such as [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/gptManagerBenchmark.cpp#L405), `response_tensors` will contain `contextLogits`.

identifier "cudaGraphExecKernelNodeSetParams" is undefined

@MrBurmark Thanks for your reply, it really helps!!! I use gcc 5.5.0 and cuda 10.1 this time, and the above problems did not occur. But I still get a lot...

High memory consumption for `ModelRunnerCpp` combined with `gather_all_token_logits`

Hi @Marks101 @vnkc1 , thank you for your feedback. This memory usage is expected. The reason for twice the amount of GPU memory for logits is because: - The new...

Performance Issue with return_context_logits Enabled in TensorRT-LLM

Hi @metterian , thanks for your feedback. Are the performance data you show based on triton? If so, could you please try to use only TRT-LLM (not based on triton)...

[fix] Eagle-2 LLMAPI pybind argument fix.

LGTM, Thanks for the fix!