Mehdi Bahri
Mehdi Bahri
Hi, thanks for your reply. I think I might be a bit confused here, but could you explain why I don't need to use dynamic batching? The way I thought...
thanks @dtrawins , so to confirm, with parallel model execution and setting NUM_STREAMS, I would just use a batch size of 1 for each model instance?
Hi, Thank you for the reminder, I'll try to clean and share the code soon. Best,
Having the tensorrt_llm package installed would also make it much easier to deploy TensorRT-LLM models with the Python backend for Triton (needed for multimodal and encoder-decoder for now)
> Can you specify transformers version as well? They made a lot of changes recently. Hi, ``` In [6]: import transformers In [7]: transformers.__version__ Out[7]: '4.41.2' ``` I also tried...
Sweet! I assumed that PR was already part of the stable release. It's looking much better with `ort-nightly`, thank you ``` Using PyTorch 2.3.1, Transformers 4.41.2, and ORT 1.19.0 Optimizing...