Simeng Liu

Results 5 comments of Simeng Liu

Close this pr as the NIM release will be based of release/1.1. Moving to https://github.com/NVIDIA/TensorRT-LLM/pull/9471.

Hi @khayamgondal , the end-to-end throughput statistics are calculated not directly reported. For example, `Token Throughput (tokens/sec) = total_output_tokens / total_latency`. `Request Throughput (req/sec) = total_num_requests / total_latency.` For the...

@khayamgondal You can try adding the `host_cache_size` option in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/configuration.py#L215-L220.

@khayamgondal You can think of on-GPU kv_cache memory as serving two main purposes: 1. Per-iteration allocation: At the start of each iteration, enough GPU memory must be available to hold...