TensorRT-LLM
TensorRT-LLM copied to clipboard
test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt)
Expand PyT llama_v3.1_nemotron_nano_8b perf tests coverage
Description
This PR adds end-to-end performance results for the llama_v3.1_nemotron_nano_8b bfloat16 engine on 1 H100.
Two broad load patterns were evaluated on PyT backend for various ISL/OSL combos:
-
Low concurrency:
concurrency = 1,requests = 8 -
High concurrency:
concurrency = 250,requests = 500
All tests use max_batch_size = 512.
Performance Summary
| Concurrency | Input Len | Output Len | #Reqs | Req Throughput (req/s) |
Output TPS (tok/s) |
Avg Latency (ms) |
|---|---|---|---|---|---|---|
| 1 | 500 | 2000 | 8 | 0.0629 | 125.79 | 15 898.9 |
| 1 | 1000 | 1000 | 8 | 0.1660 | 166.00 | 6 023.7 |
| 1 | 5000 | 500 | 8 | 0.2961 | 148.06 | 3 376.8 |
| 1 | 20000 | 2000 | 8 | 0.0637 | 127.40 | 15 698.0 |
| 250 | 5000 | 500 | 500 | 2.7919 | 1 395.94 | 77 524.8 |
| 250 | 500 | 2000 | 500 | 3.2334 | 6 466.84 | 67 673.7 |
| 250 | 1000 | 1000 | 500 | 6.0589 | 6 058.94 | 40 414.9 |
| 250 | 20000 | 2000 | 500 | 0.2835 | 566.96 | 686 971.0 |