Simeng Liu
Simeng Liu
Close this pr as the NIM release will be based of release/1.1. Moving to https://github.com/NVIDIA/TensorRT-LLM/pull/9471.
Hi @khayamgondal , the end-to-end throughput statistics are calculated not directly reported. For example, `Token Throughput (tokens/sec) = total_output_tokens / total_latency`. `Request Throughput (req/sec) = total_num_requests / total_latency.` For the...
@khayamgondal You can try adding the `host_cache_size` option in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/configuration.py#L215-L220.
@khayamgondal You can think of on-GPU kv_cache memory as serving two main purposes: 1. Per-iteration allocation: At the start of each iteration, enough GPU memory must be available to hold...