TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt)

Open venkywonka opened this issue 8 months ago • 0 comments

Expand PyT llama_v3.1_nemotron_nano_8b perf tests coverage

Description

This PR adds end-to-end performance results for the llama_v3.1_nemotron_nano_8b bfloat16 engine on 1 H100.
Two broad load patterns were evaluated on PyT backend for various ISL/OSL combos:

  • Low concurrency: concurrency = 1, requests = 8
  • High concurrency: concurrency = 250, requests = 500

All tests use max_batch_size = 512.

Performance Summary
Concurrency Input Len Output Len #Reqs Req Throughput
(req/s)
Output TPS
(tok/s)
Avg Latency
(ms)
1 500 2000 8 0.0629 125.79 15 898.9
1 1000 1000 8 0.1660 166.00 6 023.7
1 5000 500 8 0.2961 148.06 3 376.8
1 20000 2000 8 0.0637 127.40 15 698.0
250 5000 500 500 2.7919 1 395.94 77 524.8
250 500 2000 500 3.2334 6 466.84 67 673.7
250 1000 1000 500 6.0589 6 058.94 40 414.9
250 20000 2000 500 0.2835 566.96 686 971.0

venkywonka avatar May 16 '25 23:05 venkywonka