Olga Andreeva
Olga Andreeva
We've identified potential cause: a CPU overhead for small batch sizes causes FP16 model to be slower than FP32. More on this issue can be found here: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ We will...
This feature will be supported starting from 23.08 release
Hi @Kellel , thanks for suggestion! I'll file a feature request for our team.
Thank you @jsoto-gladia for reporting this issue, I filed a ticket for our team to investigate.
I believe this issue asks us to make sure that during graceful shutdown of Triton Inference Server, we properly handle in-flight requests, i.e. instead of returning an error to the...
Hi @MatthieuToulemont , have you tried specifying [`parameters`](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#parameters) for trt model in the `config.pbtxt`? For example: https://github.com/triton-inference-server/server/blob/d6bd668cf2208ef70d951182f0fda7d5a7e21c82/docs/examples/model_repository/simple_dyna_sequence/config.pbtxt#L90-L95
I am not familiar with the intricacies of your model though. If you could possibly provide an illustrative example of what you mean, it would be easier for us to...
Would you possibly consider [BLS ](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting) approach instead of ensemble? This is definitely possible in BLS.
This is true, when you use `inference_request.exec` function that allows you to execute blocking inference requests. You can also explore `inference_request.async_exec`, that allows you to perform `async` inference requests. This...
@dzier or @GuanLuo , could you clarify license, please?