teith
teith
**Description** I am experiencing an issue where the TensorRT `.engine` file is recompiled every time there is a change in the prompt length when using the ONNX Runtime backend with...
**Is your feature request related to a problem? Please describe.** The ONNX Runtime backend in Triton Inference Server lacks direct support for minShapes, optShapes, and maxShapes in the model configuration...
## Description I recently attempted to utilize INT8 quantization with Stable Diffusion XL to enhance inference performance based on the claims made in a recent [TensorRT blog post](https://developer.nvidia.com/blog/tensorrt-accelerates-stable-diffusion-nearly-2x-faster-with-8-bit-post-training-quantization/), which suggested...
### System Info GPU properties: GPU name: NVIDIA A100 GPU memory size: 80 GB TensorRT-LLM branch: main TensorRT-LLM version: 0.11.0.dev2024052800 OS: Ubuntu 22.04 ### Who can help? @kaiyux, @byshiue ###...