test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt)

Open venkywonka opened this issue 8 months ago • 0 comments

Expand PyT `llama_v3.1_nemotron_nano_8b` perf tests coverage

Description

This PR adds end-to-end performance results for the llama_v3.1_nemotron_nano_8b bfloat16 engine on 1 H100.
Two broad load patterns were evaluated on PyT backend for various ISL/OSL combos:

Low concurrency: concurrency = 1, requests = 8
High concurrency: concurrency = 250, requests = 500

All tests use max_batch_size = 512.

Performance Summary

Concurrency	Input Len	Output Len	#Reqs	Req Throughput (req/s)	Output TPS (tok/s)	Avg Latency (ms)
1	500	2000	8	0.0629	125.79	15 898.9
1	1000	1000	8	0.1660	166.00	6 023.7
1	5000	500	8	0.2961	148.06	3 376.8
1	20000	2000	8	0.0637	127.40	15 698.0
250	5000	500	500	2.7919	1 395.94	77 524.8
250	500	2000	500	3.2334	6 466.84	67 673.7
250	1000	1000	500	6.0589	6 058.94	40 414.9
250	20000	2000	500	0.2835	566.96	686 971.0

May 16 '25 23:05 venkywonka

test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt)

Expand PyT llama_v3.1_nemotron_nano_8b perf tests coverage

Description

Expand PyT `llama_v3.1_nemotron_nano_8b` perf tests coverage