README.md: Add 3rd Party Inference Speed Dashboard
Hi TensorRT-LLM team,
As an NVIDIA Inception startup, I would like to add a community link resource about an inference speed dashboard.
The inference speed dashboard feature includes:
- Comprehensive benchmarks with several optimization techniques like FP16, FP8, INT8-weight-only, and INT4-weight-only.
- Comprehensive benchmarks with model architectures like Llama3-8b, Gemma2-27b, RecurrentGemma-9b, and Mamba2-2.7b.
- Comprehensive batch sizes, input lengths, and output lengths. Batch sizes range from 1 to 32 (128 for Mamba), and input and output lengths range from 32 to 4096.
Additionally, I plan to publish the source code (website) and benchmark script by the end of this year.
Hi @matichon-vultureprime, we're discussing the best way to manage community highlights -- thanks for the PR and your patience!
@matichon-vultureprime, Thank you for this thoughtful contribution to showcase community inference benchmarking efforts! We appreciate the work you put into creating this 3rd party inference speed dashboard.
Since this PR was opened, the README.md structure has been significantly reorganized and the section where this content would have been placed has been restructured. The current README focuses on official tech blogs, news, and documentation links.
While we can't merge this specific change due to the structural changes, we value community contributions like yours. If you're interested in sharing performance benchmarking resources with the community, you might consider:
- Contributing to the community discussions or examples sections
- Sharing this resource through community channels or forums
- Exploring other documentation areas where community tools might be appropriate
Thanks again for your engagement with the TensorRT-LLM community and for creating valuable resources for other users.